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THE CYCLIC EFFECTS OF LINEAR GRADUATIONS PERSISTING IN 
THE DIFFERENCES OF THE GRADUATED VALUES 


By Epwarp L. Dopp 


U'niversity of Texas 


1. Scope of inquiry. Slutzky [1] applied the moving sum, the repeated 
moving sum, and other linear processes to random numbers obtained from 
lottery drawings. But the graph of the moving sum becomes, when the vertical 
scale is changed in the ratio of n to 1, the graph of the moving average, the simplest 
form of graduation. When cyclic effects are studied, there is no essential differ- 
ence between a moving sum and a moving average, nor between a general linear 
process with coefficients a,, a2,---,a,, having sum A # O and the corre- 
sponding graduation, with coefficients a; = a;/A. Thus Slutzky’s work throws 
considerable light upon graduation, although his main interest was in summation. 

Slutzky found that the graphs of moving sums of random numbers bore 
strong resemblance to graphs of economic phenomena, such as [1, p. 110] that 
of English business cycles from 1855 to 1877. In fact, Slutzky regards the 
fluctuations in economic phenomena as due largely to a synthesizing of random 
causes. 

In general the undulatory character of such values cannot be described as 
periodic; since the waves are of different length. But Slutzky found that, upon 
operating on random data having mean zero and constant variance, the resulting 
values approach a sinusoidal limit under certain conditions,—in particular, when 
a set of nm summations by twos is followed by m differencings, and as n —> ~, 
m/n—aconstant. Romanovsky [2] generalized this result by taking successive 
summations of s consecutive elements of the data, with s 2 2; but required that 
m/n — a # 1. However, the cases which are of interest to me just now are 
those for which m = n — 1 orm = n — 2; and for these cases m/n > 1. Ro- 
manovsky considers the case of m = n — 1,—not, however, as leading tov a 
sinusoidal limit,—and gives in formula (46) the value of a coefficient of correla- 
tion—which I deduce directly. From his formula (43) a corresponding coeffi- 
cient of correlation can be obtained for the case of m = n — 2, as the sum of 
certain products. A more simple expression than this I need, which I obtain 
directly. In my treatment, these coefficients are the cosines of angles; and the 
ratic of such an angle to a whole revolution is an expected frequency of 
occurrence. , 

After setting forth in Section 2 some preliminary formulas, I treat in Section 3 
the results of applying to random data an indefinite number k + 2 of summa- 
tions or averagings, followed by k differencings—the number of terms in a sum 
remaining fixed. In Section 4, however, only a few differencings are applied to a 
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128 EDWARD L. DODD 

graduation. In particular the Spencer 21-term formula is studied in some 
detail. In former papers [3, 4] I have dealt with the immediate effects of 
graduations upon random data. 

The question to be considered in this paper is this: Do the cyclic effects appear- 
ing tn the graduated values persist in the successive differences? And, if so, do 
these affects fade out gradually or on the other hand, do they come to a rather abrupt 
termination? 

These differences of graduated values, indeed, up to the third, fourth or fifth 
are of considerable importance. Henderson [5] defines the smoothing coefficient 
of a given graduation as the ratio of the theoretical standard deviation of the 
third differences for the graduated values to that for the original values or data. 





2. Preliminary notions and formulas. The data to be graduated will be sup- 
posed to be independent, or uncorrelated, or as Slutzky expresses it, “‘inco- 
herent.” This will imply that the expected value of the product of two different 
chance variates is the product of their expected values. 

Now the operations of summing and differencing as used here are not inverse. 


To illustrate: Given as independent u, v, w, xz, y, z,---. Summing by twos 
yields the sequence u + v,v + vw,w+2z,r+y,y +2z,---. But the first 
differences of these numbers, w — u, zr — v, y — w, z — 2, --- are alternately 


correlated, thus w — wu is negatively correlated with y — w; xz — v withz — 2, 
etc. Indeed, successive differencing following successive summing does not lead 
back to the original condition of incoherency. However, under certain condi- 
tions, the resulting coherency may be so slight that the final succession of num- 
bers may have just about the same chaotic properties as the succession of data. 
In my paper [3, p. 262], I set forth a number of features on the basis of which 
a cycle length could be defined. One of these involves the frequency of maxima. 
Given independent chance variables, each subject to the same law of distri- 
bution, 
(1) P(x; S x) = ®(2); 


where ®(z) has a derivative ¢(z). It is then easy to see that the expected rela- 
tive frequency of maxima is 1/3. That is: 


(2) Plas S$ 4% ten) = [ (@G@)P G2) de = 1/3. 


Now, for a given feature, a cycle length is defined as the reciprocal of the theoretic 
relative frequency. Then the cycle length here for mazima is three. It is well 
known that averaging tends to remove maxima. Thus, upon averaging or 
summing, the cycle length tends to increase. It is almost as well known that 
differencing tends to increase the frequency of maxima, and thus decrease cycle 
length. For if z; = Ay; = yisi1 — y;, then between two maxima of y; , there is 
at least one minimum (strong and weak) of y; ; and following this minimum and 
before passing the next maximum of y; there is at least one maximum of z;._ Suc- 
cessive differencing tends to reduce the cycle length of maxima from 3 to 2, 
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that is to make the graph a perfect zig-zag where positive and negative values 
of z; alternate. A set of differencings following a set of summings may bring 
the cycle length from some fairly large number back to about 3,and thus restore 
something like the original chaotic appearance in the graph. 

In dealing with the foregoing ®(x) or ¢(z) in (2), it was not assumed that the 
distribution be normal. But, in what follows, it will be assumed that 


1 —(e—p)2/262 , 
a(2n)1/2 , 
and, for convenience, yu will be taken as zero—that is, the data will be supposed 
given as deviations from their theoretic mean. Actually, the data used by 
Slutzky and the data I have used belong to a rectangular distribution, as noted 
in my former paper. Nevertheless the close agreement between actual and ex- 
pected results seems to indicate [3, p. 263] that the theory is in general applicable. 
It is well known that averaging of observations from non-normal distributions 
may lead rather quickly to an approximately normal! distribution. 

Given n real numbers, a), dz, --- ,@n, let 


(3) ¢(x) = 


(4) Yj = A; + Aina +++ HF AnZign; a= 1,2,3,.---- 


Then y; is the moving sum if eacha, = 1. Slutzky takesj = iorj =i+n-— 1. 
Again, y; is the moving average if each a, = 1/n. For graduation in general, 
the condition 2a, = 1 is imposed; and usually 7 = 1 + (n + 1)/2. If nis odd, 
y; is thus associated with the middle z. 

Under the assumption that the z’s are independent and normally distributed 
about mean zero, with constant variance, I have proven [3, p. 256]: The proba- 
bility that for any specified 7, y;.1 < 0, and y; > 0 is given by P = 6/360°, 
where 


n—l r=n 
(5) cos 6 = D) Gdryi / Dyas. 
r=1 r=1 


The expected relative frequency of up-crossings of the graph of the y’s through 
the zero base line is then 6/360°. That is: 6/360° is the expected relative fre- 
quency of a change in the sign of y from — to +; also, of a change in sign 
from + to —. 

But, as Ay; = Yy;41 — y;, it follows that 


(6) Ay; = bit; + bet i41 + eee + brLipn—1 + bn4iZien 5 
where 
(7) bh = —a, dng = On, 0, = Ory — G,, r=2,3,---,n—1 


and since a maximum for the y’s at y; occurs when Ay;-; > 0, Ay; < 0, it follows 
that the theoretic frequency therefor is 6’/360°, where 


n+l 


(8) cos @’ = > be bras Zz b:. 
r=] r=1 
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In a similar manner, by using second differences, we get the expected relative 
frequency 6’’/360° for inflexional points, in specified direction. Moreover, 
6s 0 Ss 6” S.-- S 180°; since inflections must be at least as frequent as 
maxima, etc. 

If the foregoing formulas are applied to the identical “graduation” y; = z;, 
then cos 6 = 0, cos 6’ = —1/2, cos 6’”’ = —2/3. In fact, 


(9) cos 0° = —t/(t + 1). 


This follows from the fact that the b’s and similar coefficients are the binomial 
coefficients; and 












t t—1 


(10) Zz & = aC; ; es Cre Cra = eC rr. 


r=0 r=0 






Thus repeated differencing leads toward the perfect zig-zag. An extension of 
this feature will be taken up in the next section. 











3. Repeated summing and differencing. To indicate the result of the sum- 
ming of n consecutive numbers in a sequence, I shall use the notation 1”. And 
the difference Ay; = yj4: — y; Will be indicated by —1,0"", 1. Thus if n = 3, 
1* and —1, 0’, 1 will stand respectively for 


Ql) w=%tatut tn; Ays = —2-1 + Ox; + OFi41 + Fixe. 
If, now, 4 = Yi + yi + Yiu, then 
(12) Zi = Mee + Qa + 8a; + Qin + Vine. 


Since (7) is often used to indicate the operation of summing n consecutive num- 
bers, we may write 


(13) (3)* = 1, 2, 3, 2, 1; (n)’ = 1,2,---,(n — 1), n, (n— 1), --- , 2,1, 
Then, for n > 2, 









at A(n) = -1",1";  A%(n)* = 1,0", —2,0°7, 1. 
And, since the operations of summing and differencing are commutative, we 
are lead to 


(15) FR = (—1)*a*(n)* = 2Co, 0°", —iC1, 0", 2C2, 0", «+ , (— 1) ; 

as may be established by induction. For from the foregoing, it follows that 

(16) (—1)‘a*(n)** = Ct, —2C?2, +++ (—I4CE. 

Then, since x4:1C; = .C, + :Cr1, we conclude that 

(17) Fat = (—1)"(n)* = ead, 0°, — aC? 0", «++, (— 1) enCen- 
If now n 2 2, then from (5) and (15) we find that 

(18) 6/360° = 1/4. 








cos 6 = 0; 
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Thus, the expected frequency of the changes in sign of A*(n)* is the same as 
that for the raw or ungraduated data. Moreover, if n = 3, (8) leads to cos 6’ 
= —1/2, found for the data. For, in this case, at least two zero coefficients 
intervene between any two non-zero coefficients. And thus 


k k 
(19) cos 6! = -d 1c? [2d ct = —1/2. 
r=0 * r=0 


In fact, the same factor cancels from numerator and denominator as we take 
higher differences, if a sufficient number of zeros intervene. More explicitly 
stated, the formula (9) found for the data is valid also for A*(n)*, provided 
n2zt+2. 

To make this more concrete, it may be noted that cycle lengths corresponding 
tot = 0, 1, 2, 3, and 4, are respectively 


(20) 4, 3, 2.73, 2.60, 2.52. 


From (15), we see directly that an element of A‘(n)* is correlated only with 
certain other elements which are at distances from it which are multiples of n. 

Some of the foregoing results may be included in a theorem as follows: 
THEOREM: Given a sequence of independent chance variates, each subject to the 
normal distribution (3) with mean zero. Upon this material, let k summings or 
averagings by n be performed and k differencings, in any order. Then the resulting 
sequence has something of the same chaotic nature as the data. In particular for 
n = 2 the expected frequency of changes of sign is the same,—viz., 1/4 for change 
from minus to plus and 1/4 for change from plus to minus. Moreover, as n is 
increased from 2 to 3, 4, 5,---, the expected frequency of other characteristics 
becomes the same, maxima and minima, points of inflection, etc., in accordance 
with (9). 

But, suppose now that after k + 1 summings by 2, only k differencings are per- 
formed. Is the resulting sequence almost chaotic? Hardly so. At least, it 
can be shown that changes of sign in each direction have no longer an expected 
frequency fixed at 1/4; but this expected frequency decreases as n increases. 
To show this, formula (5) is applied to (16); and setting in (10), C = oC, , 
od = oxC. 2 it follows that 


(21) cos 6 = [(n — 1)C — C’]/nC = 1 — (2k + 1)/n(k + 1). 


Then cos @ > 1 — 2/n; and the cycle length for expected changes of sign in 
definite direction is somewhat greater than that obtained by setting cos @ = 
1 — 2/n. For values of n not too small, we may write cos @ = 1 — 6°/2, ap- 
proximately; and then approximately 


(22) cycle length for definite change of sign in A*(n)**? is r+/n. 


If n = 9, this approximate length is 9.4, assuming & fairly large, whereas the 
more exact length is 9.2. 
Consider now the result of summing k + 2 times, and then differencing only k 
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times. For this purpose, a few formulas for summing squares will be useful. 
By the method of differences it can be shown that if 1 = a + nh, and 


(23) T=a°/2+ (ath) + (a+ 2h) +---+ (a+n— thy’ +22, 
then 


(24) T = nia + al + P)/3 + (1 — a)*/6n. 

Suppose, now, that a/n takes on the values 0, «Co, —2C1, --- , (—1)":Cx in 
succession, while J/n takes on the values .Co, —.Ci, --- , (—1)"2Ck, 0. Let U 
be the sum of the (&K + 1) values of T thus obtained. Then by (10). 

k+1 
(25) U = n'(2 xCy — 2xCe-1)/3 + 0 2 nC /6. 
t-_ 
n’ (k + 2)(2k)! , n 
(26) U — 3 kW + IDL + 6 2n+2C k41 . 


Now, by applying to (16) one more summation by n, there are formed (k -+- 2) 
arithmetic progressions of (n + 1) terms each, alternately increasing and de- 
creasing. The maximum and minimum terms at the juncture of the progressions 
are to be split into two halves to apply (23). Then the sum of the squares. of 
these coefficients is given by (26). This forms a denominator for (5). 

To obtain the numerator for (5) we note that from ab = [a’ + b’— (a — b)’]/2 
it follows that if 


(27) V=a(a+h) + (a+h)(a+ 2h) +--- + (a+n-— lh)(a+ nh); 
then, from (23), 
(28) V =T — nh’/3 = T — (l — a)’/3n. 


If now W is the sum of such V’s, reference to the last terms of (24) and (26) 
shows that 


(29) W = U — (n/3)ax42Cey « 
And hence, from (5), 
og . (k + 2)n® — 4k-— 2 

Then 

n’'—4 
(31) cos 6 > aa! 
but only slightly greater when k is large. Again 
(32) cos 6 > 1 — 6/n’; 


but only slightly greater when n is not small. In this case, cos @ = 1 — 67/2, 
approximately. And thus, approximately, for large k, and for n not small 


(33) cycle length for definite change of sign of A*(n)**? = 1.81n. 


This gives for n = 10 a cycle length of 18.1; whereas, if cos 6 is taken as the 
right member of (31), the cycle length is 18.2. 
Thus, if a (k + 2)-fold summation or averaging of random data is followed 
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by only & differencings, the resulting graduation or linear processing z = A*(n)** 
is decidedly not as chaotic as the data; as seen from (31) and (33). But further, 
Az = A***(n)***; and thus from (22) the cycle length for the expected maxima 
of z is about 4+/n. 

Now Slutzky [1, p. 109] distinguished conspicuous waves from inconsequential 
“ripples.” On this basis, the frequency of significant cyclical features for a 
chance variable, such as z, would be less than the frequency of the maxima. It 
is not so clear that the frequency of significant features of a chance variable 
will be greater than that for changes of sign in definite direction. That turned 
out to be true for graduated values such as discussed in my earlier paper 
(3, p. 262]. If this be also valid for z, we would expect that conspicuous ‘“‘waves’”’ 


of A*(n)*** would have average length between x+/n and 1.81n, except for small 
values of n and k. 


4. Graduations or linear processes and their successive differences. If double 
summation by n is followed by a single differencing, the result—as indicated in 
(14)—is, for n = 3, 


(34) Ys = — Ui — Linn — Liga + Liss + Tips + igs. 
Then 


(35) Yits = —Zigs — Tine — Digs + Tine + Liss + Teys. 


Thus y; and y;43 are negatively correlated; since 7:43, Zi44, aNd X45 appear 
in each, but with sign changed. This would seem to tend to make maxima 
alternate with minima at distances of about 3; or at distances of n, in the general 
case (14). Here, following Slutzky and Romanovsky, the coefficient of correla- 
tion r, between elements at a distance of p is taken as 


(36) tp = E(2_+2r4p)/E(2,). 


Using computed averages, instead of expected values, Alter [6] recommends 
a “correlation periodogram,” in which r, is the ordinate for abscissa p. 

Moreover, we would expect a graduation (4) with coefficients a; proportional 
to the ordinates y of the sinusoid y = sin (a + 2x2/p) taken for x = 1, 2, 3, --- 
to impress upon random data oscillations with maxima separated from minima 
by about p/2. But such a; , as well as those in (34), have.abrupt endings which 
introduce noticeable alterations. More satisfactory results come from tapering 
ends, such as appear in damped vibration, with coefficients about proportional 
to e~*!*! cos 2xz/p or to e~*'*! sin 2xz/p. H. Labrouste and Mrs. Labrouste [7] 
give a powerful operator of this description. 

Slutzky (loc. cit. pp. 119-123), Yule [8], and Walker [9] make use of damped 
harmonic vibration to explain the creation of cycles; while Bartels [10] ap- 
proaches by a different method the oscillations that do not last. 

Now the common graduation formulas have coefficients not conforming strictly 
to damped vibration, as the tapering ends vibrate more quickly. However, 
these ends have little more than a smoothing or stabilizing effect. Furthermore, 
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the coefficients for first differences are likely to conform to something like 
e “'*! sin 2x2/p. Some experimental evidence will be presented for the following 
conclusion: 

If the coefficients a; of a graduation or linear process (4) appear to conform 
roughly to equidistant ordinates of a damped vibration, +e “'*! cos 2nzx/p or 
+e “!*! sin 2xx/p, with changes of sign at intervals of p/2, then when this process 
(4) is applied to independent chance data having zero mean and constant variance, 
there ts a tendency for the graduated or processed values to change sign at intervals 
of about p/2. 

A number of standard graduations have first and second differences—see (6), 
(7)—which bear a decided resemblance to damped vibrations, while the third or 
fourth differences have only moderate, if any, cyclic appearance. This is espe- 
cially true of those graduations which are constructed by applying three sum- 
mings—the number of terms in a sum being in general different—and a fourth 


TABLE I 


Coefficients (X350) for Spencer 21-term graduation and for first four differences. 
Also theoretical cycle lengths for change in sign in values obtained from 
random data 








Cycle 


























Length 

+ 6,18 ,33 ,47 ,57 ,60,57,47,33,18,6 
~. i555 2,5,5,3,1 _— 
ip, +1,2,2,0 __3,10,14,15,12,8,3 7.0 

_— 3,8,12,15,14,10,3 0,2,2,1 ; 
d + 2,3,5,4,3 ; 3 ,4,5,3,2 —— 
- —1,1,0 1,4,7,6,7,4,1 0,1,1 “= 
rar, +1,0 1,1,4,3,3 1 £1.34 

’ yt E504 a od ee = 
3 D — Rolelshe 1 3,3,4,1,1 0,1 , 
aa * 14.4 10144180 1 1,!1,1 

: a a a a a eae ices aes 1.6 
via i3302033 1 i 





process with negative coefficients. This is, indeed, a favorite form of gradua- 
tion, with which are associated the names of Woolhouse, Spencer, Higham, 
Kenchington, Henderson, etc. The Spencer 21-term formula, for which some 
features have already been described, [3, p. 262], will now be examined, with 
special reference to its differences. Cycle length for change of sign is one-half 
that for change from minus to plus. 

In the graduation formula, itself, there are 11 positive coefficients, centrally 
located, and relatively large as compared with the negative coefficients. This 
11 is close to 10.7 the theoretical cycle length for changes of sign of y, — 4.5, 
the difference between the graduated value y, and its mean—the arithmetic 
mean of 1, 2,---,9. The structure of the first and second differences also 
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matches closely the corresponding cycle lengths. In the third differences, there 
is a break at the center; but still there appears considerable regularity. But 
among fourth differences, the zigzag is the prominent feature. Now the theorem 
of Section 3 does not really apply to the Spencer formula, with its two summa- 
tions by fives and one summation by sevens, and another process. But it is not 
surprising that the cyclicity ceases after passing the third differences. 

As a basis for comparing observed values with expected values, the tenth 
digits in the 600 logarithms from log 200 to log 799 were taken as a random set 
of numbers. These 600 numbers had been given a Spencer 21-term graduation 
(3, pp. 261-262], yielding 580 graduated values. From these the 579 first differ- 
ences were found, the 578 second differences, etc. These numbers, 580, 579, --- , 
were multiplied respectively by the expected relative frequences of change in 
sign of y, — 4.5, of Ay, , A’y, , ete., as found by use of (5), (8), and similar ex- 
pressions to form the following table. 

The most abrupt change in frequency or cycle length appears to occur in 
passing from third to fourth differences. In Table I, this is seen in the configura- 


TABLE II 


Comparison of expected changes of sign with observed changes for a Spencer 21- 
term graduation 





| Expected Number of Observed Number of 
| Changes from — to + | Changes from — to + 


Graduated values—4.5............. | 27 .2 
First differences................... | 41.3 
Second differences..................| 52.9 
Tited Gieremons............-.65-65) 
| 


Fourth differences.................. 





tion of positive and negative terms, and in the drop from 3.2 to 1.6 in cycle 
length; and in Table II in the corresponding increase in expected sign changes 
from 90.4 to 176.7. More spectacular is the increase in the number of zig- 
zags represented by —, +, —, +. Among the third differences, there were 
found only 13 instances of four successive terms with signs as just indicated, 
whereas among fourth differences there were found 75 such instances. For 
random material, about 36 such zigzags would be expected—decidedly more than 
found among the third difference, and decidedly less than found among the 
fourth differences. 

The Spencer 21-term graduation appears to be fairly representative of com- 
monly used graduations as regards regularity or irregularity in the distribution 
of positive and negative coefficients among the differences. For graduations 
with a much larger number of terms, the alternation of sign in fourth differ- 
ences may not be so rapid, as, e.g. in the 35-term 5th degree parabolic gradua- 
tion which Macaulay [11] calls No. 18. On the other hand, for a formula with 
non-tapering ends, such as the 13-term formula which Macaulay gives [11, 
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p. 64], the coefficients appearing in the differences are more irregular, especially 
at the ends. While the Spencer formula is fairly representative, different for- 
mulas have distinguishing features. If it is desirable to form an idea of what a 
given formula will do to random data, a table like Table I can be constructed. 


5. Summary. When upon independent chance data, summing; averaging or 
some more general graduation process is used, the graduated values tend to 
assume a wavy configuration. These waves often seem to have a fair amount 
of regularity or cyclicity. The first differences usually, and often other differ- 
ences of the graduated values, are decidedly cyclic. But, as we go in turn to 
the higher differences, the cyclicity may weaken. Indeed there may be a return 
to something like randomness. And subsequent differencings may tend to set 
up zigzags. 

If (k + 2) successive summings by n have been performed on independent 
chance data, with n not too small, say n = 5—then k + 2 differencings will 
just about bring back the original chaotic or random condition. But with only 
k or (k + 1) differencings, a definite cyclicity remains, at least theoretically, in 
the expected values. 


In the case of the Spencer 21-term graduation, the coefficients for the suc- 
cessive differences indicate the appearance of cyclicity in first, second, and third 
differences. 
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ON THE DISTRIBUTION OF WILKS’ STATISTIC FOR TESTING THE 
INDEPENDENCE OF SEVERAL GROUPS OF VARIATES 


By A, Watp! anp R. J. BRooKNER! 
Columbia University 


1. Introduction. We consider p variates 2, 72, --- , X» which have a joint 
normal distribution. Let the variates be divided into k groups; group one con- 
taining 21, %2,--- , Z»,, group two containing 2,41, Zp,42,°-+,2p,, etc. We 
are interested in testing the hypothesis that the set of all population correlation 
coefficients between any two variates which belong to different groups is zero. 

Wilks” has derived, by using the Neyman-Pearson likelihood ratio criterion, a 
statistic based on N independent observations on each variate with which one 
may test this hypothesis. Let || 7;:;|| be the matrix of sample correlation 
coefficients; Wilks’ statistic, A, is the ratio of the determinant of the p-rowed 
matrix of sample correlations to the product of the m-rowed determinant of 
correlations of the variates of group one, the (pe — pi)-rowed determinant of 
correlations of th. .econd group, ete. That is 


. | ris | 


| T2181 | + | Tasbs | ++ | Tose | 


where | 72,8; | is the principal minor of | r;; | corresponding to the ith group. 

In order to use the test, the distribution function of \ must be known. Wilks 
has shown that in certain cases the exact distribution is a simple elementary 
function; in other cases it is an elementary function, but one which is rather 
unwieldy and which does not lend itself readily to practical use. It is our 
purpose in this paper (1) to show a method by which the exact distribution can 
be explicitly given as an elementary function for a certain class of groupings of 
the variates, and (2) to give an expansion of the exact cumulative distribution 
function in an infinite series which is applicable to any grouping. 


2. The exact distribution of }. By the method to be described, the exact 
distribution of \ can be found when the numbers of variates in the groups are 
such that there are an odd number in at most one group. If the number of 
variates is small, say at most eight, the method will increase only slightly the 
list of distribution functions that Wilks gives in his paper. 


1 Research under a grant-in-aid of the Carnegie Corporation of New York. 

2$. S. Wilks, ‘“‘On the independence of k sets of normally distributed statistical vari- 
ables,’’ Econometrica, Vol. 3 (1935), pp. 309-326. Other references to Wilks in this paper 
except where otherwise noted are to this publication. 
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For purposes of deriving the distribution of \ we may assume that E(z,) = 
0, (u = 1, 2,---, p); that there are n = N — 1 independent observations 
Lua (a = 1, 2,---, nm) on each variate z, ; and that the sample covariance 





c 
between 2; and 2; is given by s;; = i Lialja/n. We define u’ (a function of u) 


a= 

to be the total number of variables in all the groups which precede the group in 
which x, lies. ‘The complete theory is independent of the ordering of the groups 
and of the ordering of the variates within the groups; hence without loss of 
generality, we may assume that if any group contains an odd number of variates, 
it will be the last group, hence w’ is always an even integer. 













Pp 
Wilks has shown that is a product J] z, where each z, is distributed 
u=p;tl 


independently of the others, and that the distribution of z, is 
erry i wer —2) 


(1) regi 

Bi3(n —a+ +1), u’ /2) 

Now let yz = log z, , then the characteristic function of y, is 

asi aa, 

BI}(n — u+ 1), u 
a. 

BIF(n —u -~ es + 1), u 















u(t) sal: ef 108 #u ” iii od — ye’ —2) dz, 


error 4(u’—2) 
l1—z dz, 
amd ena 9 


where ¢ is a pure imaginary. It is known’ that this integral, even with complex 
exponents, is the Beta-function so long as the real parts of both exponents are 
greater than minus one, so 


g.(t) = BM — ut) +4, w'/2) 
BI}(n — u + 1), w’/2] 


Mem —u+1)+t-im—ut+1+v)) 
rk(n —uti+u)+4-Tm—ut+ 1) 





(2) 





But here w’ is always an even integer, hence by the well known recursion formula 
of the Gamma-function, which is valid for complex arguments excluding only 
negative integers 













Bn —utw — 1) +ay" 


where 





= [a(n — ut IIB — w+ 3))--- Gm -—utw' — 1). 


* See Whittaker and Watson, A Course in Modern Analysis, Fourth edition 1927, Chap. 12. 
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Now set 
y = log = Yost + Yrige + +++ + Yp 
and the characteristic function of y is 


6) = TI calm —u +1) +d —u+3) +0 


U="?~P1 
- in —utu’ —1)4+ 4}. 
From the characteristic function, we can obtain the distribution function, 
g(y), of y by the relation 
g(y) = vl apn a a certo 
iri ico * Tens [s(n —ut+1)+4---[n-—ut+uw—1)+ 4 


where 


Pp 
co= II cx. 
w=p)t+l 

The integration can be carried out by the method of residues; since y is always 
negative (the range of \ is from 0 to 1), on a half circle with center at the origin 
in the negative half of the complex ¢-plane, the integral of the function (t) 
converges to zero as the radius of the circle becomes infinite. Since (f) is 
analytic except for a finite number of poles on the negative real axis, g(y) is cn 

times the sum of the en at these points. 


Now &(t) is of the form £_ where P(t) is a polynomial in ¢ as follows: 


P(t) 
suppose that the groups contain 1,, r2, --- , 7, variables respectively, then let 
(kK; + 1) be the number of these r’s which are greater than or equal to 7; then 


P(t) = [}(n — 2) + t[4(m — 8) + t)*[3(n — 4) + eh [R(n — 5) + ae 
[3(n — 6) + eftettet* . 2. [R(n — p+ 1) + tft tte et tebe e-891 
where 


_ o/2 if o is even 
o/h = «NOR owed. 
Then ' 


p—2 
gly; 71,72, +++, Te) = Cn , iS ~ {(t + 3(n-—a- 1)) 97" (8) 4 (ne) 
a=l 6. 6! c 


where 


Oa + 1 = ka + haz + +++ + kpycat2))-14a-) - 
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then we will calculate the distribution function f(A; 4, 4). 


g(y; 4, 4) = 


and 
Cn 


Then 


g(y; 4,4) = 164] 


Since 


we have 


fr;4,4,) = 


Ju(4, 4) 


Wilks’ expression for the cumulative distribution function appears to be quite 


Cn 
271 


3 


! 


Lice [3(n — 2) + t][3(m — 3) + tI[3(m — 4) + tP 
-[(n — 5) + tP[3(m — 6) + tlm — 7) +74] 


io 





Pye" 


al eh?) y 


6Cn 
3 


| - 


16c, yi? 





90 


ri (n—4) 
30 


+ ofr?) y +4 


aa ehir—Oy +} 


y = log), 


rn? (n—5) 
2 


r? (n—8) 


“NNT 


2 


e “dt 


Retr y Setn—®) y 


—— 


2 


The cumulative distribution function is given by 


Prob [A < w; 4, 4] 


dy = 


3 


rite) 








9 


e (n—T)y 
90 


> 
i , 


3 


30 


ye 





Ntw 


i(n—4)y 


3 


’ Tk). 








It can be shown that @, is = 0 for a between 1 and p — 2. Thus we have 
, Tx) and from it we can calculate f(A; 71, 72, --- 
Suppose p = 8 and that the variables are divided into two groups of four each, 


i(n—5)y 
ye 
+e |. 


ak oe” + yr) log »}. 


[ 1 _ wt _ 4(4n — 23) | 14(4n — 13)u! 
15(n—7) n-—6 3(n — 5)? 3(n — 4)? 

w w! ( 2w! | 

* $=3~ a -8* oe + A) ew 


different, but if we substitute n = N — 1 and use the relation 


Bya(N — 6; 4) 


_ T(N - 2) 
~ T(N — 6)-T(4) Jo 


Vw 


a2*—"(1 — x) dz 


= 3(n — 2)(n — 3)(n — 4)(n — 5) 





Byte 


4(n—5) 
w 
{ — 5 n—A4 + 


8wi*» 
n—3 


wi (n—2) 
n—2 
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it can be shown that the two formulas for the cumulative distribution are 
identical. 

In cases where wu’ is not always an even integer, the exact distribution func- 
tion of \ can still be obtained using this method. However, in such a case, the 
gamma functions do not cancel out and the integrand has an infinitude of 
poles, so the function is expressed by an infinite series. We will use a different 
method to obtain an infinite series expansion. 


3. A series expansion of the cumulative distribution function. Let us put 
v = —y, and let the density function of v be h(v), then from (2), we have 


_ a en fo Fr Tham—u+1)+idde 
hv) a= & 2x1 [. ' wmry+l T[4(n —utilt+ u’) + t]" 
Since v is a monotonic decreasing function of \, and since the critical region for 


testing the null hypothesis is given by the inequality ) < », then the critical 
region will be defined by v > vo , where v» is such that 


[ 4 h(v) dv 


is equal to a chosen level of significance. 
PROPOSITION 1. 


h(v) = ha(v)¥(v) 
where ¥(v) does not depend on n, and h,(v) = cre *”. 
Proor: Let 

= t+ 3(n — p). 
Then 


h(v) ™ & * ei ert’ 4n—p)) T[4(p —ut+i1)+ t’] at’ 


2mt Licoth(n—p) » Tip -—utw+1)+t] 


Now the area in the complex plane bounded by the vertical line through $(n-— p), 
by the vertical line through the origin, and by ares of a circle with center at the 
origin of arbitrary radius is one in which the integrand is everywhere regular. 
Furthermore, the integral along the arcs approaches zero as the-radius of the 
circle approaches infinity, hence the integrals along the vertical line through 
}(n — p) and along the vertical axis are equal. Then we may write 


e*” — 16 oceterm T[}(p — u + 1) + tJ dt’ 
—ho) =5-[ Win@-wtw sD +7 
= (vr). 


Therefore 


h(v) = c,e*” Pv). 
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PROPOSITION 2. 


I = lim [ ee 
n->co 0 I(r) 


where we define 


: : os 
Ti; 

T= — 

jmiti iat 2 


Zlrer: + r3(ri + 72) + +++ + ore(ri + re + ee + 7x-1)] 
thw’. 


Proor: Let 


then 


j cree ' dy = I ene (2) (v*)" dv* 
9 0 n 
c (7) T(r). 
n 


_— (7) 
n 


yy ear etit+v) 


oo = 41 Tin — ut) 


and therefore 


n 


r3(n—u+i1+u’) S = 


an rs | 


by an application of the Stirling approximation. Therefore 
I=[[1=1. 
We then write 


yo) = YOrO 


yr 
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hence 
_ ene yyy) 
(3) h(v) = Tr” 


ProposiTIon 3. For any positive integer s, 


1 
li go — T= = L. 
lim {r rob (o > Jz) 0 
Proor: Since v = —log X, the inequality v > 1/+/n is equivalent to the in- 
Pp 
equality \<e-/V™, Sinced = J] zu, the inequality \ < e—/V* implies that 
u=p)tl 


there exists at least one value of u for which 


Zn < ee Pv 


Hence 
Pp 
dP < g Ve-mvs) > P(A < eve) = P(v > 1/V/n). 
U=?~P1 


Hence in order to prove Proposition 3 we have only to show that for each u and 
any arbitrary positive integer s 


lim {n’-P(z, < e-V@-mdvV")} = 0, 


n->o 


From (1) we have 


Plz, < e-e-p) Va) 


ath 1 [ CUOION® Noe(] ~ 2)’ de, 
~ BE(m — u + 1); u'/2] do , . 


Over the range of integration, we have z, < e—/(e-p)V* go 


el(n—u-1) /(p—p1)-Vn eV Pi) Vn 


Pla < NeW") S ie — a + 1 TA 


(1 ca ore dz, 


e-A(n—u-1) /(p—pi)/n 
~ Ba —u + 1)5w72] 
_ Qe-H(n—u—-1) I(p—pi)-/0 
u’-Bl4(n — u + 1); u’/2] 
It follows from the Stirling formula that 
tin (5) BA — w+ 5 w 21 = in Ae Ga) 
= I(u’/2). 


2 e—1/(p—71) Vn 
|-2 (1 = "| 
u 0 


[1 aon (1 —_ e-Up—pr) Vn) *"/?) 
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lim nit’ * o-+/n/2(p—p1) = 0 


no 


lim (1 — (1 — e-VV*)) = 
the proposition follows. 
Proposition 4. The function ¥(v) of formula (3) can be expanded in a power 
series, 1.€. 


¥(v) = ao + aw + aw +... 


with a finite radius of convergence. 
Proor: Wilks* has considered the following integral equation: 


. 
bi + t)-T(b, + e« -» T(by + @) 
t d ‘a CB' r( 1 @ 

i “ane Ta + 8)-T( +t) --- Te + 8)’ 

= F(ci)-T(c) --- Mca) 

where © = TGi)-T a) «= Fy)’ 

(¢ = 1,2,---,q). Wilks has shown that the solution of the integral equation, 
g(w), is given by the following expression: 


kwee “7 <7 


B and g(w) are independent of ¢, and b; < ¢; 


g(w) = 


5) 
[ [-. . [i ¢;—b;—1 ae Cqg—1—bg—1-—-1 
ee Vq—-1 


x My - sapesionty — yt Pet. (Ys) 
x E - a(a “ yy: — {1 + n(1 — »)} ¢ d 7) a 
x E — {yn + (1 —) +--- 


+ teal = w)(1 =m) «++ (heed }(1 - BY 


X dv; doz +++ dvg1 


j=0 


4S. 8. Wilks, ‘“‘Certain generalizations in the analysis of variance,’’ Biometrika, Vol. 24 
(1932), pp. 474-5. 
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the range of w being 0 S w < B. Wilks has furthermore shown that 


(5) {ur + (1 — m1) + --- + (1 — )(1 — m) --- (1 = w}(1 -e)<1 


forw > Oand0 Sv; 5 1(=1,2,---,q—1). 
We denote the left hand side of (5) by ¢:. The factor (1 — ¢,)"*~***" can be 
expanded in a power series, i.e. 


(6) (a — ¢,)°ees+1 =(1— t,) eset) 

= 1+ (Cir — dbd)Si + FC — D(C — OF + DEE + --- 
with a radius of convergence equal to one. Since we will show shortly that for 
the choices we make for the b,’s and c,’s, ci4; > b; , then all coefficients in this 
last expansion are non-negative. Substituting this series expansion (6) in (4), 
and ordering it according to powers of (1 — w/B), the expression under the inte- 
gral sign (in 4) becomes 


Go(r1, Va, °°° V¢-1) 


; 2 
+ (v1, + 0¢)(1 — B) + olen, «++, 00(1 - 3) + ose, 


This series is uniformly convergent over the domain defined by the inequalities 
0s%,51(@=1,2,---,¢g—1)and|1—w/B| <1. Wecanevensay that 
(7) is uniformly convergent for |1 — w/B| < 1 if we substitute for each 0; 
the maximum of 6; with respect to v1 , v2, --- , U1. Hence we may integrate 
the series (7) with respect to 1 , ve, --- vg, term by term, i.e. 


(8) [[- [ dodn «++ dvg1 = oo + a(1 - 8) +o(1 - 


and the series (8) is uniformly convergent for |1— w/B|< 1. The coefficients 
%,%1,-°*- are non-negative. 

The case of the \ statistic which we are considering is a special case of this 
integral equation which we obtain by making the following substitutions: 


(7) 


w=), B=1, vw=r+n, gq=p-n 
b = 3(n—ut1), ¢ =$n—utw+4+i1), (r=1,2,---,p— m) 
Note that then 
Cri — b, = ¥[(u+ 1)’ — 1] 20. 
Hence, according to (4) 
g(r) dd = ke — a)" fag + on(1 — A) + a2(l — 0) +--+ 4 aA 


where the infinite series converges for |1 — A| < 1. 
Now v = —log \X, orA = e ’, hence 


—v\ rl 
h(v) dv = ke APP yr C=) {eo + ev + ev” + ---} dv 
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where the series {e + ev + ev + .---} is obtained from the series {oo + 
oi(1 — A) + ---} by substituting for (1 — A) the Taylor expansion of (1 — e””). 
The series {€ + ev + ev + ---} has a finite radius of convergence.’ 

Hence the function ¥(v) can be written as 


ge 
v(v) a A.ctme(1— 2°) { € +euv+ ev + — 


y - 1 —_ e” r—l : 
where A denotes a constant factor. Then since e*?~”” C= can be 
v 


expanded in a Taylor series around v = 0, Proposition 4 is proved. 


4. Evaluation of the coefficients in the expansion of y¥(v). Let the series 
expansion of ¥(v) be 


¥(v) = ap + aw + aw’ +-- 


Then we have 


oo —tnv_ r—l 
Cn€ v 2 
| tere + + 


Now let v* = 5” then 


re ree grad % - 
[ (?) ee ( mes a + ++) do = 1. 
0 


n T(r) n 


Suppose that the asymptotic expansion of (3) >is given by 


Bi 


n 


Bs 


tote, 


Bo + 


On account of Proposition 3, we have that the asymptotic @xp° sion in powers 
of 1/n of 


Jn —y* Pt 
(9) I e°v (a 4 201 1 ~ te as -) do 
; 


T(r) n a 


must be equal to the asymptotic expansion of (7) . . Since we may integrate 


n 


in (9) term by term for sufficiently large n, we easily obtain 


ee 
2r’ eer + 1)--- r+k—1) 


ao = Bo, ay 


5 See A. Gutzmer, Theorie der Eindeutigen Analytischen Funktionen, 1906, pp. 91-2. 
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The asymptotic expansion of (7) > can be calculated in the following manncr: 


n 


By Be 
(- + iy o tatat mtopt 
7 48s 8y.. 


Ca ie 


Equating the right hand members of these last two equations, and taking 
logs, we obtain 


B B = 
lor n+ P+ B+ s+ |= rlog (1 + 2/n) + E log ( — ) 


— Dog (1 -— “$= 1) 4 tog (56 + B+ B+ -. -), 


Then we expand each term in a series of powers of 1/n and equate coefficients 
of 1/n* for each 7. We obtain the following formulae for the first five 6’s: 


fo = 1 
A=r+3> (u—1)?-2>) u—w' —1) 


2 
a= a+ -* pw -tDw-w- 
Bs = —$B: — Bi — 4361 + BiB, + 282 + 3r 
+X Dw-v-ALDw-w 


Bi 


261 + 261 + B+ F — 3B:B: + Bibs — BiBe — 46s 


Be a a -: = i: ae 
+5 + 36s s°+a2-6 1) wey u’ — 1)°, 


5. Practical use of the series. In practical applications, the value of the 
statistic, say Xo, is calculated, and it is desired that we determine whether or 
not this value of the statistic falls into the critical region. That is, for a partic- 
ular grouping of the variates, for a particular number of degrees of freedom, and 
for a chosen level of significance a, there is determined from the distribution of 
A, a value A* such that 


Prob [A < A*] = 
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and if \» < A* we reject the hypothesis that in the population from which the 
sample is taken all the correlation coefficients between variates in different 
groups are zero. 

Since v is a monotonic decreasing function of \ we make the test by computing 
vo = —log do and we reject the hypothesis if v) > v* where v* = —log A*. But 
this is equivalent to computing Prob [v > vo] and if this value is less than a we 
reject the hypothesis. Now 


Prob [v > vo] = Jor; T2, ¢**, Tk) 
= a [. ey (1 + anv + av + ---) dv. 


Setting 3 = 2 


r oo 2 

Prob [v > »] = (?) Cn ee) + a. + ea(?) z+ vos fae 
n I(r) nvo/2 n n 

On account of Proposition 3 we obtain an asymptotic expansion of Prob [v > v9] 

by integrating the right hand member of the above equation term by term. 

This can be expressed by means of the incomplete gamma function, which is 

tabulated® in the form 


ur/pti 
. ve’ dv 
I 0 
We obtain 


Prob [v > v] = (2) eff = a — 1) 
+f rel) BL eee} 


The values of the constant K = (2) C, and the values of 6,, Be, Bs, Bs are 


herein tabulated for any grouping which might be made on six or fewer variates. 
Some cases, such as groupings (1, p — 1), in which case the distribution of \ 
is the distribution of the multiple correlation coefficient; and as the groupings 
(2, p — 2), the exact distribution for which was given by Wilks as an incomplete 
Beta-function, are superfluous here. These cases are included only for the sake 
of completeness. 


6K. Pearson (Editor), T'ables of the Incomplete Gamma Function, Biometric Laboratory, 
London, 1922. 
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Table of the First Four B’s 
” @xsentnn 


| Bo Bs Bs 
4 8 16 


~ 
— 


~~ ~~ 


oro 


or 


or 


2 
1 
3 
2 
2 
| 
4 
3 
3 
2 
2 
J 
5 
4 
3,: 
+ 
3 
2 
3 
2 
2 
1 


6.28125 
12.03125 
19 


23.53125 | 


28 .625 
28 
55 


62.53125 | 


77 


95 .625 


55.78125 | 


125 
154 .03125 


136.28125 | 
189.53125 | 


214 
203 .625 
229 .03125 
244 .625 
260 .78125 


86.03125 | 


13.38281 
36 .91406 
65 
83 .97656 
106 .9375 
120 
285 
334. 10156 
439 
506 . 16406 
580.6875 
315.82031 
910 
1205 .03906 
1015.50781 
1584.10156 
1866 
1740 .9375 
2042 . 16406 
2230. 1875 
2430 .49219 


27 .57568 
111.55225 
211 
279 .50538 
366 .39844 
496 
1351 
1615.91163 

2229 

2628 . 23974 
3085 . 52344 
1690 .65282 
5901 

8277 .55226 
6693 . 45068 
11445 .75538 
13947 

12797 .27344 
15530 .08351 
17257 .64836 
19139 .02892 
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Tables of the Constant K = (7) C. 


111 
738 
761 
780 
796 
-810 
.822 
.833 
843 
.851 
.859 
.866 
.878 
.888 
.896 
.903 
-910 
.922 
-932 
-940 
-946 
-950 
-954 
.958 
.961 
- 966 
-970 
.973 


31 


-646 
-676 
. 702 
724 
743 
759 
774 
787 
-798 
.808 
.818 
-834 
.847 
-859 
.869 
-877 
.894 
-908 
-918 
-926 
-932 
-938 
-943 
947 
-953 
. 959 
. 963 


22 


.560 
.595 
-625 
.651 
.674 
693 
711 
727 
741 
754 
765 
785 
.802 
.817 
.829 
-840 
.862 
.879 
892 
-902 
911 
.918 
924 
-930 
-938 
945 
-951 


211 


.517 
.553 
585 
.612 
.637 
.658 
677 
.694 
.709 
723 
736 
.758 
177 
793 


.843 
.862 
877 


-907 
.914 
.920 
.930 
.937 


1111 


477 
.515 
.548 
.576 
.602 
.624 
.645 
663 
.679 
.694 
.708 
732 
752 
.770 


.825 
.846 
862 





WILKS’ STATISTIC 


Tables of the Constant K (it) 
221 2111 32 11111 51 
. 269 .248 .336 .229 .323 
.310 . 288 .379 . 268 .369 
.347 .325 .417 .304 .410 
.381 .359 .451 .338 .445 
.412 .390 .481 .368 .478 
.441 .418 .508 .397 .506 
.467 .444 .533 .423 .532 
.490 .468 .556 447 .555 

.490 .576 .470 .576 
.532 511 .595 .490 .596 
.551 .530 .612 .510 .613 
.584 .564 .642 ‘ .644 
-613 ‘ .668 ‘ -671 
.638 j .691 d .694 
.660 .642 711 ‘ .714 
.680 j .728 : .731 
.720 .704 . 764 ‘ . 767 
.751 ‘ .791 ‘ .794 
.776 . .813 ‘ .816 
.797 -785 .830 . .833 
.814 .803 .845 ‘ .848 
.828 .818 .857 j .860 
.841 .831 .868 , .870 
.852 ‘ .877 ; .879 
.869 , .892 d .894 
.883 .876 .903 j .905 
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Tables of the Constant K (itz) 


321 
108 
.140 
171 
.201 
.230 
257 
284 
.309 
332 
.354 
.375 


222 


.094 
.123 
.152 
.180 
.208 
.235 
.261 
.285 
.308 
.330 
.351 
.390 
424 | 
.456 
484 


3111 


-100 
.130 
.160 
.189 


2211 


-087 
114 
.142 
.170 
197 
.223 
.248 
.272 
295 
317 
.338 
.376 
411 
442 
471 
497 
.552 
.597 
.633 
664 
.690 
712 
732 
749 
777 
.800 
.818 


21111 
-080 
106 
133 
- 160 
. 186 
.212 
236 
- 260 
283 
.304 
.325 
363 
.398 
-430 
458 
.484 


111111 





THE MEAN SQUARE SUCCESSIVE DIFFERENCE 


By J. von NeumaANN,! R. H. Kent, H. R. BELLINSON AnD B. I. Harr 
Aberdeen Proving Ground 


1. Introduction. In making measurements, every precaution is generally 
taken to hold the conditions of the experiment constant, in order that the 
population, whose parameters are to be estimated from the observations, shall 
remain fixed throughout the experiment. One wishes each observation to come 
from the same population, or what is the same thing if normality is assumed, 
from populations having the same means and standard deviations. 

There are cases, however, where the standard deviation may be held constant, 
but the mean varies from one observation to the next. If no correction is made 
for such variation of the mean, and the standard deviation is computed from 
the data in the conventional way, then the estimated standard deviation will 
tend to be larger than the true population value. When the variation in the 
mean is gradual, so that a trend (which need not be linear) is shifting the mean 
of the population, a rather simple method of minimizing the effect of the trend 
on dispersion is to estimate standard deviation from differences. It is for this 
purpose that the mean square successive difference 


(1) 2 = x (rigs — 2)” 
n—1 


is suggested. The subscript 7 in this expression refers to the temporal order of 
the observation 2; . 

In using 3° for estimating standard deviation, the distribution of 8° in random 
samples is of interest, since questions of bias, efficiency, and confidence interval 
require consideration. & may be used, in addition, to determine whether a 


trend actually exists; in this case one must know whether 8° differs significantly 
from 


(2) 2 4e-7 


n ’ 


which measures variance independently of the order of the observations, and 
consequently includes the effect of the trend. 


1 Institute for Advanced Study, Princeton, N. J. Also member of Scientific Advisory 
Committee of the Ballistic Research Laboratory, Aberdeen Proving Ground. 
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The distribution of 8° is considered in this paper; it is hoped that others will 
shortly publish methods of estimating the probability that 8° < ks’ as a function 
of k and the sample size n. 





2. History. A somewhat similar procedure is suggested by “Student”’ [1] 
‘and E. S. Pearson [2] who consider the situation in which a shift may occur in 
the mean of the population, but where pairs of observations may be made with 
no shift in mean between them; standard deviation may be estimated from the 
differences between these pairs. The method can be generalized, and 




























n/2 


a (25 — 2i-1)" 


+ 


n 





is an estimate of the standard deviation. m must, of course, be an even integer. 
This estimate has the advantage that its properties are fully known: s’ is dis- 
tributed as the standard deviation with f = n/2 degrees of freedom. It will be 
noted that this estimate does not involve the successive differences, but only 
the alternate ones. Although there are n — 1 available successive differences, 
this estimate uses only the n/2 independent differences. The mean square 
successive difference is based on all m — 1 successive differences, and should 
therefore provide a more efficient estimate of o than does s’. 

There is, of course, nothing new in the concept of estimating the standard 
deviation from differences. Even as far back as 1870, an interest in the method 
appears to have existed. Jordan [3] devised methods based on sums of powers 
of the differences. Helmert [4] gave more careful consideration to the case of 
the first power, i.e. the sum of the absolute differences. In both these cases, 
however, all the n(n — 1)/2 differences that can be established from a sample of 
n observations were included in the estimate, so that the estimate was of no 
value in reducing the effect of a trend. Helmert realized this, for he pointed 
out that the estimate obtained from the sum of squares of the differences is 
exactly that obtained by the more conventional procedure of squaring deviations 
from the mean. 

The usefulness of the differences between successive observations only appears 
to have been realized first by ballisticians, who faced the problem of minimizing 
effects due to wind variation, heat and wear in measuring the dispersion of the 
distance traveled by shell. Vallier [5] appears to have been the first to estimate 
dispersion from successive differences. Cranz and Becker [6] commended the 
mean successive difference 

n—1 
2 | tes — | 
Ez = aw 
To establish the precision of Ez in estimating o, Cranz and Becker quoted 
Helmert’s paper, and so erred in saying that their method was superior to that 
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of the mean deviation. Helmert’s procedure, based on n(n — 1)/2 differences, 
is indeed more precise (for n > 10) than the mean deviation 


n 
D |a — 2| 
se =. 





M.D. 





n ’ 


but the mean successive difference is based on but n — 1 differences, and so is | 
not as precise. 
Bennett [7] appears to have suggested the use of successive differences inde- { 
pendently of the European ballisticians. In recent years, the method of esti- 
mation by the mean square successive difference 5° was put into practice in the 
Ballistic Research Laboratory at the Aberdeen Proving Ground, U. S. Army, | 
by L. S. Dederick 
i 


3. Bias and efficiency. The moments of & in samples drawn from a normal 
population are derived in Section 6 of this paper. The moments are used at 
this point to establish the estimate of variance, and the efficiency of this estimate. 

The mean value of 6’ in samples taken at random from a normal population is 


(3) E(8’) = 20’. 


§ consequently offers an unbiased estimate of variance, and this estimate is 
n—1 “i 
( 4) 8 dX (aig1 -_ Xi)” 
$j" mi) 


The second moment, i.e., the variance, of & in samples of size n ‘ 
’ ’ 


St RS WI I Se Fe 


(5) a3 = 


aca ame 


As the sample size is increased, the distribution of & appears to approach 
the uormal. It is therefore appropriate to consider the efficiency as defined by 
Fisher [8]. Accordingly, the efficiency of 8° is 


Ls / aes |- 


2 _ 2(n — 1) 
— 


Since 


4 
ov; 
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the efficiency of 5° in estimating the standard deviation is 


2n—1)_ 2 1 
(6) 3n — 4 -eht+ah 


The efficiency is unity for n = 2, since in this case the two statistics have 
the same distribution. It therefore appears that the efficiency decreases as the 
sample size increases, but approaches 2/3 as a limiting value for n very large. 











4. Summary of procedure. Having a statistic which estimates a parameter 
of a population, it is desirable to know the distribution of that statistic as com- 
puted from samples taken at random from that population. At present, the 
distribution of 5” in samples of n has not been obtained. The difficulty is in the 
fact that the successive differences are not independent. The first difference, 
d, = 2 — 2, and the second difference, d, = x3 — 22, are related in that they 
both involve z,. Similar correlation exists between every successive pair of 
differences between successive observations. 

For n = 2, and samples taken from a normal population, the distribution of 
& is known. Since 













2 
8 = (x — m1) = 2D) (x; — 2)? = 48", 
t=] 


the distribution of 8’ is similar to that of s’ for this sample size. 

For n = 3, the distribution of 5° has been derived analytically. The deriva- 
tion is indicated in Section 5 of this paper. For n > 3, only the moments of 
the distribution have thus far been obtained. A Pearson type distribution has 
been fitted to the first three moments to obtain an approximate representation 
of the true distribution. 


5. Distribution of 5°. In the case of a sample of n taken from a normal popula- 
tion, the probability that the first observation lies between xz; and 2;-+ dz, 
while the second lies between z2 and 22 + dzz, etc., is 















1 in 
(7) e =| e fitest- + ++22)/20% dz, dz2 nal dit, ; 


If ¥; = X41 — 2; , this expression becomes 





1 n 
(8) l | g Menntenae ta 28 das dy dys +++ Yn, 


ov 2x 


where Q is a quadratic form in z; and the y’s. Since 





n—l 
Dv 
— t=] 

= — 1’ 
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the probability that 4° shall be less than some value 49 is 


(9) P< 8) = 5 7a / / oa / [Fete dandy oo des. 


1 
Zvi<—y 


After the integration with respect to z; is carried out, the quadratic form in 
the exponent may be normalized by a transformation to new coordinates z; 
linearly related to the y’s. The z’s may be so chosen that all the terms z{ in 
the exponent have the same coefficient, in which case 


+S os 
(0) PO <a=aff... fe 2,6, Yay +++) Yort) go do... den a. 


d(2, mm) ***, 2n-1) 
As a result of such a transformation, the sphere of integration in (9) becomes an 
ellipsoid in (10). By changing to polar coordinates, with 
n—l 
P= Dia, 
i=] 


(11) 
P(® < &) =a | | e** 5" dodr, 


In which © is the solid angle in the space of n — 1 dimensions. The limits of 
integration with respect to 2 as a function of r must be found; this involves the 
evaluation of the solid angle subtended by the surface bounded by the inter- 
section of the (n — 1)-dimensional sphere and the (n — 1)-dimensional ellipsoid. 
If 2 = ¢(r), 


(12) P(8 <8) =e [ oe a(n)r" dr, 


in which a is the longest semi-axis of the (n — 1)-dimensional ellipsoid cor- 
responding to the given value of &. 
For n = 3, (9) becomes 


P(® < 8) = E 75. If C exp |- aa + ¥3 + my) 


vi+y} <282 


3 2 . 
(13) “ 2.(x + atm) Jen dandy 
1 


™ 2/3 x0" | | einen dy: dys. 
vity} <285 
Normalizing the quadratic form in the exponent, 


(14) P(s < &) = 3 aw | / e Cittp 26? de, dee, 


2 2 
21 +23 <285 
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and in polar coordinates 


1 ptov sali 
PR <8) = 955-3 | rr 
e 0 


_s ‘ bov2 —r2/2¢2 - r2 sin? 0/302 
= 24/3 - | re | f é dé | dr. 


The integral in brackets can be shown to be a Bessel function of zero order; 
for let 


(15) 






r/30° = —2iu, 





vis 


oem 













then 
(16) . F sin? 9/80? 9 - e™ [ e™ mete, ae Qnre™ Jo(u). 
0 ® 
Consequently, (15) takes the form 
1 bov/2 
(17) P(s < 59) = o4/3 , ; ori, (5 ;) ar = F (65). 


The probability density function 
_ aF = ) 









ps 


_ 1 —262 /302 15 
(18) = o4/3° Jo 352 


1 
A/8 e 2atise8 E + 

















1 

22 32o4 © 22473 
6. Moments. The é-th moment of & about.the origin is defined by 

(19) = E[(*)'l, 


or 
fn — 0) = #([E con - 23*]) 
. B([2 32 — (ai + 25) — 2 aut). 


For any value of ¢t, the expansion can be performed, and similar terms col- 
lected and enumerated. The values of x can be considered as true errors, i.e. 
as deviations from the true mean, without affecting the conclusions. If the 








(20) 
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original population from which the samples have been drawn is normal, with 
standard deviation oc, then: 


E(2*") = 0 


21 
( ) E(x”) ~~ oo ?, 


and since, in the null case where the mean of the population remains constant, 
successive observations are independent, then 
E(2i 25) = E(2’™), t=j 


(22) : 
E(2523) = E(2’)E(z’), i j. 


These relations are sufficient for the evaluation of u,. For example, in the 
case of the second moment, t = 2: 


(n — 1)? us = B(|2> ai — (ai + 2%) — 2> nn). 


n—l 2 
2 ya ai — (ti+a,) —2 » Fin | 
n—l 2 


4 (& at) + (ai + 25)? +4 (= iyi Li 


n—l 


— 4(x} + 23) ya —8 ED Sta + 4(23 + 2%) 2 tsi 


| Dat + > ata} | + (et + ated + 2 


t=] 4,91, i967 


n—l 
- |Z atast | ~ [at + at > ai + 25, Lata | 
+ [terms containing odd powers of z,). 


The mean of these terms is found by using (21) and (22), and the number of 
each type of term present is enumerated: 


4[n(30°) + n(n — 1)o’o"] + [30° + 20's” + 30] + 4[(n — 1)o’o” 
— 4[30* + a(n — 1)0* + o'(n — 1)o* + 30'] = (4n? + 4n — 12)o°%. 
Consequently 


,_ 4(n?+n—3) «4 
(24) ws = Mee. 


The first four moments about the origin were evaluated by this procedure, 
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and from these, the moments about the mean are readily determined. The 
results are: 
1 = 20” 
i 4(n? + n — 3) 
(n — 1)? 
_ 8(n' + 6n* + 2n — 21) 
: (n= 1) , 
_ 16(n* + 14n*® + 538n” — 8n — 231) 
a —. Cg 
(n — 1) 


° 
og 


0 
. 4(3n _ 4) J 
(n — 1)? 
_ 32(5n — 8) ¢ 
a eo 
_ 48(9n* + 46n — 112) o 
™ (n — 1) 

It should be noted at this point that the above fourth moment is incorrect 
for n = 2. One of the terms in the expansion of the right side of (20), for 
t = 4, is 

n—l 
zi, x T4125. 
For n = 2, the mean value of this term is 
E(aizi232i) = E(xi)E(23) = 9", 


whereas for n > 2, the mean value is 


n—2 
E(ziz32,) + E (tet ~ 22!) + E(ziziizt) = (n+ 3)o°. 
t=2 


7. Pearson type fit to distribution of 5°. From the moments it is found that 
us _ 16(5n — 8)” 
f= 


Be (3n — 4)° ; 


, =— Mt = 3(9n? + 46n — 112) 
kia (3n — 4)? 


A, = 
(26) 


As n becomes large, 6; and #: approach 0 and 3 respectively; the distribution 
therefore appears to approach the normal for large samples. For finite sample 
sizes, the values of 6; and #2 correspond to those of the Pearson Type VI 
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8° e @2 8 —9@1 
(5) = (5 +4) (+ «) | 


The origin of this distribution is at &° = —a,o’, but the origin of the true dis- 
tribution must be at & = 0. By taking a, = 0 so that the origin is at &° = 0, 
we obtain what appears to be a suitable approximation 


&° & a2 & a1 
- (a) -e(G) Gra)”. 
The parameters are determined by equating the Ist, 2nd and 3rd moments of 


(27) to the corresponding moments of the true distribution, with the result that 
3n‘ — 10n* — 18n* + 79n — 60 | 





distribution, 


qe = 8n? — 50n + 48 ie | 
gq = 4 me + DG + 3) 
4— +1 ’ 
(28) ua(qe ) 
a = Ma — & — 3) 
q@+t+1 . 
aj 


¢= —— 
B(q2 + 1, q — gz — 1) 
Values of these parameters for selected values of n are given in Table I. The 


sixth and seventh columns of this table give the values of 8, for the distribution 
(27) and for the true distribution, respectively. 


TABLE I 
(1) (2) (3) (4) (5) (6) (7) (8) : 
B2 Bs Ratio it 
(7) True (6)/(7) i 
5 24.4391 0.6391 26.6000 5.8800 Kk 10* 8.807 8.504 1.036 it 
7 31.1286 1.3857 23.2571 4.9285 xk 10% 6.948 6.758 1.028 
10 41.2830 2.5079 20.9667 9.4934 x 10" 5.658 5.538 1.022 
15 58.2113 4.3806 19.2659 4.0240 x 10" 4.718 4.645 1.016 
20 75.1210 6.2543 18.4851 1.8063 K 10% 4.269 4.217 1.012 
25 92.0189 8.1285 17.9417 8.1097 K 10"* 4.006 3.965 1.010 
50 176.4443 17.5018 16.9651 1.3386 XK 107° 3.494 3.475 1.005 


The Tables of the Incomplete Beta-Function [9] can be used to evaluate the 
probability integral of the distribution (27), 


8 ) 82/02 (5) (5 63 (5) 
1- Iq = = 1, qe +1) 


iy ieee 
a, + 56/0°’ 


n q Q2 a2 c 


















(29) 
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forn S 14. For n > 14, the probability integral may be determined by quad- 
rature. Some values of the probability integral for n = 50 are given in Table II. 
A comparison with the integral of the normal curve having the same first two 
moments indicates that a sample of somewhat more than 50 is required before 
the normal curve becomes a satisfactory approximation to the distribution (27), 


TABLE II 


2 2 
P(5 <4) for n = 50 


o 


56 lo” (29) Normal 
.90 - 00000 -00118 
15 * .00031 -00563 

1.00 .00647 .02129 

1.25 .04393 .06418 
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THE RETURN PERIOD OF FLOOD FLOWS 
By E. J. GuMBEL 
New School for Social Research 


Introduction. Engineers have used various interpolation formulas to repre- 
sent the observed distribution of flood discharges. These formulas are some- 
times constructed ad hoc for a given stream, and have no general meaning. Most 
of them are rather complicated.’ Some authors have tried to introduce upper 
and lower limits to the discharges, even though it is doubtful that such limits 
exist. Others have introduced the third and fourth moments of the distribution, 
in spite of the fact that these numerical values are subject to large errors. For 
some formulas it is impossible to give a meaning to the constants; different form- 
ulas applied to the same stream give rather contradictory results; and conse- 
quently there is considerable confusion. For example, Slade [20] has stated that 
“the statistical method in whatever form employed is an entirely inadequate 
tool in the determination of flood frequencies.’”’ According to Saville [19] “the 
engineer should satisfv himself that he has used an adequate number of methods, 
whether mathematica!, graphic or otherwise, which have real support from either 
theory or experience, and then form his own judgement.” 

The main reason for this situation is that these studies have little or no 
theoretical basis. The author believes it possible to give exact solutions, 
exactitude being interpreted from the standpoint of the calculus of probabilities 
[10]. Our solutions are simply the consequences of a truism: “The flood dis- 
charges are the largest values of the discharges.’’ The present study is but an 
explanation of this statement. 

Many American authors start with a statistical function, which we call the 
return period of floods. Therefore we shall first analyse the notion of return 
period and show how it can be derived as a consequence of the concept of dis- 
tribution. We then give a short résumé of the theory of largest values. The 
discharge, and in consequence the flood discharge, is considered as an unlimited 
statistical variable; it is not necessary to determine its distribution. We are 
justified in representing the observed distribution of flows by one of the the- 
oretical distributions of largest values. The distribution we choose contains 
only two constants, and both have a clear hydrological meaning. The numeri- 
cal values are calculated by the method of moments. 


1 In recent years many articles discussing this topic have been published by the American 
Society of Civil Engineers and the American Geophysical Union [8]. A review of some of 
the proposed formulas is given in the Water Supply Paper 771 [17]. 
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The application of the notion of return period to the largest values leads to a 
simple formula for the return period of the floods. In the last part of this paper 
we represent the flood flows of the Rhéne and Mississippi Rivers by our formula. 


1. The return period. Let us consider a continuous statistical variable z, 
having a theoretical distribution w(x). The probability W(x) of a value less 


than or equal to z, and the probability P(x) of a value greater than or equal to 
x, are Z 


(1) Wa) =| wed, Pl) = [ w(e) ae, 
where z denotes the variable of integration. Clearly 
(1’) W(x) + P(x) = 1. 
Let be the number of observations. Let rz, (m = 1, 2,---, n) be the 


observed values arranged in increasing magnitude, where m is the serial number 
beginning with the lowest (‘from below’). The lowest observation has the 
serial number m = 1, the highest has the serial number m = n. These observed 
values will be written z;, and z, respectively. The number of observations 
below or equal to 2m is m = n’W (am) where ‘W(2m) is the observed relative 
number corresponding to the probability W(x). The graphic representation of 
this series is called a cumulative histogram. 

In hydraulics many authors arrange the observations in decreasing magnitude. 
Let mz (m = 1, 2, ---, ) be these observed values. The serial number m is 
counted in a descending scale (‘from above’). For the largest value m = 1, 
for the lowest value m = n. The number of observations above or equal to 
mz is m = n’'P(mx) where ’P(,x) corresponds to P(x). The numbers ’W(zm) 
will never decrease; the number ’P(,,7) will never increase. The mth value on 
a descending scale is the n — m + 1th value on an ascending scale. Therefore 


(2) n’P(mt) =n — n’'W(rtm) + 1, 


and 
(2’) nP(xz) = n — nW(z). 


The difference between formulas (2) and (2’) will play a certain réle later. 

Different methods are used in statistics in comparing the theoretical values 
W(x) or P(x) and w(x) with the corresponding observations ’'W (rm), or ’P( mx) 
(cumulative frequencies) and A’W(zm) (frequency distribution). They all have 
in common an arrangement of observed values according to magnitude. 

For the purpose of considering the observations in chronological order, we 
introduce a statistical criterion which at first glance may appear to have a new 
logical structure. It is assumed here that the observations are made at constant 
time intervals, and this interval is considered the unit of time. We suppose 
that the observations are homogeneous, i.e., subject to a common set of forces. 
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Furthermore, we suppose that the events are independent of one another: the 
occurrence of a high or low value for z has no influence on the value of any 
succeeding observation. Let us choose a low value z, and ask the following: 
After what number of observations does this or a greater value return? We 
calculate the mean of these chronological intervals between every two consecu- 
tive values, equal to or greater than z. We repeat these operations for a second, 
third, . . . till the penultimate value of z. 

These means are called the observed return periods. The criterion consists of 
the comparison of the observed, and the theoretical return period for increasing 
values of z. For a discontinuous variable we could obtain the return period for 
a value equal to z, (not equal to or greater than z). This average time, which 
is sometimes used in physics, does not interest us, as our variable, the discharge, 
is continuous. We limit our consideration to the return period of a value equal 
to or greater than z, called: value greater than z. 

The determination of the theoretical return period is a classical problem: 
How many trials must, on the average, be made, in order that an event of a 
given probability should happen? Our event, the realization of a value, equal 
to or greater than z, has the probability P(x) = 1 — W(z). 

The mean number of trials 7'(z) which are necessary to obtain our event once, 
is evidently 


1 . 
1 — W(z)’ 


(3) T(z) = 


or 


1 
/ — 

(3’) T(z) = Po)’ 
This value 7'(z) is the mean chronological interval between two values, equal 
to or greater than z. If we start at the time when such a value has been ob- 
served for the first time, we can interpret 7'(z) as the theoretical return period 
of a value equal to or greater than z. We designate it as the theoretical return 
period. This concept has not been used in statistics. It is a well-known con- 
cept in hydraulics which was introduced by Fuller [6]. To every theoretical 
distribution w(x) there is a corresponding return period 7(z) and conversely, 
to every theoretical return period 7(x) there is a corresponding distribution 


_ T’(a) 
(4) w(t) = T(z)’ 
obtained by differentiating (3). 

If the variable is without limit to the left, the return period will start with! 
T = 1. If the variable is limited to the left by z = «€ the corresponding return’ 
period will be 


(5) T(e) 21 if W(e) 2 0 
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In the graphic representation, the return period 7'(x) which has a time dimen- 
sion, will be the abcissa and z the ordinate. Therefore we consider z as a func- 
tion of T(x); from (4) we obtain 


dz 1 

” dinT ~ w@)T@) 

where In signifies the natural logarithm. The increase of z as a function of 
In T(x) will be very rapid for small values of T. For a limited distribution 
the same result is obtained, provided the probability W(e«) and the density of 
probability w(e) are sufficiently small. Clearly, the return periods of the three 
quartiles are respectively 14, 2, 4. The return period will always increase 
with z. It will tend towards infinity even if the variable is limited to the right. 

Let us now consider the calculus of the observed return periods. Instead of 
values equal to or greater than z,, we will only speak of values greater than z,,. 
The observed return period is the interval between the first and the last observa- 
tion greater than z,, , divided by the number of intervals between all observa- 
tions greater than z,. The number of observations greater than zm is n — 
n'W (xm). Between these observations there are n — n’W(am) — 1 intervals. 
This denominator is independent of the chronological order of the observed 
values. We can calculate the mean of the observed intervals up to a value zt» 
so that n — n’W(xm) = 2. For this value of z,, there are only two observa- 
tions, i.e., only one interval. In that case no mean can be calculated. 

The numerator, the interval between the first and the last observation greater 
than z» will be n — 1, provided tHat the first and the last value in chronological 
order are greater than z,. But in general the first value greater than 2», will 
be the (’k + 1)th in chronological order. The first value greater than zx» found 
in the reverse chronological order, will be the (k’ + 1)th. Let’k +k’ = l, then 
the interval between the last and the first value greater than z, is n — 1 — lL. 
The mean observed interval is thus 


iT (tm) = (n— 1 —1))/(n — 1 — n’W(an)), 


or 


(7) 1T (2m) = (1 _ -4) / ( _ —) 


This magnitude depends only on the chronological order of the first and the 
last value greater than z». It is independent of the chronological order of all 
other observations. Even in the case 1 = 0 this value differs from the theoretical 
value (3). The observed value surpasses the theoretical value, even if the 
frequency 'W(zm) is identical with the probability W(z). 

In the general case, | > 0, this difference is a function of 1. The number | 
depends upon the times at which the observations begin and cease; but it is 
not a characteristic of the chronological order. As a result of these disad- 
vantages of formula (7) we prefer to introduce other definitions, in which the 
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chronological order does not enter. These definitions have an added advantage 
in that they are constructed in a manner analogous to the theoretical formula. 
The observed value which corresponds to (3) is 


‘ = n 

(8) T (tm) = n—'W(an)’ 
or 

(9) 'T (tm) = n/(n — m). 


But this definition of the observed return period is not the only one which 
corresponds to (3). Starting with the serial number m, in a descending scale, 
Fuller [6] puts 


(8’) "T (tm) = =. 


According to this definition, the return period of the mth value from below is 
(9’) "T (tm) = n/(n — m + 1). 
TABLE I 


Two definitions of the observed return periods 


observed serial number serial number exceedance interval recurrence interval 
variable from below from above formula (9) formula (9’) 


11 1 n n/(n — 1) ] 

Ze 2 n—- 1 n/(n — 2) n/(n — 1) 

Za m n—m+1 n/(n — m) n/(n — m + 1) 
Bont n—1 2 n/\ n/2 

He n 1 _— n/\ 


This observed return period corresponds to the theoretical return period (3’). 
The difference between (9) and (9’) results from the fact that the relation (2) 
between the observed cumulative frequencies 'W(x»,) and ’P(,2) differs from the 
relation (2’) between the probabilities W(x) and P(x). The two definitions 
of the observed return periods are related by 


(10) "TT (@m41) = 'T (&m) < 'T (m4). 


From a purely logical standpoint the first definition is as justifiable as the 
second one.. Both are used in hydraulics. In order to avoid confusion between 
formulas (9) and (9’) Horton [16] calls ’7'(r2m) the exceedance interval, i.c., “the 
average interval at which an event of given magnitude is exceeded,”’ whereas 
he defines ’’7'(%m), the recurrence interval as “the average interval of occurrence 
of values equalling or exceeding a given magnitude.”’ Of course, the exceedance 
interval surpasses the recurrence interval. Since both observed intervals cor- 
respond to a common theoretical return period we designate both of them as 
observed return periods. 
The difference between formulas (9) and (9’) is made clear in Table I. 
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Each of the definitions (9) and (9’) and the theoretical expression 7'(x) has 
different properties. For the lowest observation 


n'W(2) = 1; n’P(,2) = Nn. 














Therefore 


Tn) =1+——;  "P(e) = 1, 


whereas for an unlimited distribution lim 7'(z) = 1. 


If the number of observations is sufficiently large the numerical differences 
between the two observed periods are rather small, except for very large values 
of the variable. For the last observation 


n'W(2,.) = 1; 



























n’ P(,2) = |. 


Therefore the return period ‘7'(z,) for the last observation does not exist. Ac- 
cording to the second definition the return period for the last value is equal to 
the total number of observations. But in general there is only one observation 
of the last value. 

The preference given formula (9) over (9’) corresponds with the preference 
given to W(x) over P(x) when comparing the theoretical with the observed 
values. Therefore it is natural to count m from below. Since both definitions 
are equally applicable and since they lead to different results for large values of 
the variable, one should not calculate the return period for a small number of 
observations. 

The observed return periods (9) and (9’) differ from the theoretical return 
period (3) in the same way that the frequencies ’W(z,,) or 'P( 2) differ from the 
probabilities W(x) or P(x). The chronological order enters neither into formula 
(7) nor into (9) or (9’). We need not take it into consideration, since the 
theoretical return period is obtained from the probability and the observed 
return period from the cumulative histogram. Therefore the usual statistical 
methods can be used for making the comparison between observed and theoreti- 
cal return periods. 

The return period is a statistical function like the distribution, w(x) or the 
probability W(x). No formula for T(z) that contradicts the properties of w(z) 
can be accepted. The return period 7'(x) will contain the same number of inde- 
pendent constants as the distribution w(x). Consequently the fit of the theo- 
retical curve T(z) to the observations ’T{xm) or ’’T(%m) cannot be improved by 
introducing a new constant without also changing the distribution w(x). The 
theoretical curve x = f(T) will fit the observed curves (am, '7'(%m)) and 
(tm, ’'T(2m)) in a way that depends upon the fit of W(x) and P(x) to 'W(zm) 
and ‘P(m2x). 

Let us suppose that w(x) contains k constants; that they are determined by the 
method of moments which conserves the arithmetic mean Z, the mean of the 
squares 2? etc. of the observed distribution. For the return period these mo- 
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ments have a meaning. Let us consider for the sake of simplicity a positive 
variable. The kth moment M, 


M [ 2* dW(2) 


os I ” gt d(1 — Wie) 


k [ (1 — W(x))a** dx 


is according to (3) 


(11) M, =k [+ . TH 


whence fork = landk = 2 


, a | ee [ ° dz 

(11’) t= | Fe Ba’) =2 | To. 

For a given distribution containing two constants, the method of moments con- 
serves the area and the, center of gravity of the reciprocal of the return period. 
Even if the method of ‘methods gives the best determination of the constants, 
for the distribution, it need not give the best determination for the return 
period. But if the observed return periods were used for the determination of 
the constants we would get two sets, since there are two observed curves having 
equal validity, but different values for large z. We will get one and only one 
set if the constants are calculated from the observed distribution, for here the 
difference between ’7T'(xm) and '’T'(zm) does not matter. The fact that we do 
not take the constants from the observed return periods, but from another 
statistical function, might be a cause for deviations between the observed and 
the theoretical return periods. 

Once the constants have been found, we compare the observed curves 
(tm, ’T(2m)) and (2m, ‘’T(tm)) with the theoretical curve z = f(T). To avoid 
discontinuity the observed return period will be established for all values of z», 
arranged in increasing order. 

If the observed return periods for small values of z are systematically smaller 
(greater) than the theoretical period, it is reasonable to conclude that there 
exists an attraction (repulsion) for small values of the variable and a repulsion 
(attraction) for the large values. But it must be remembered that the observed 
values have different weights in that the return periods for small values of z are 
based on many observations. This number diminishes as z increases. The last 
observed return period is based only on two observations. Therefore the di- 
vergence between theory and observation will increase with the variable. With 
this precaution the criterion of the return period suggests one cause of difference 
between theory and observation. In order to apply this method to the largest 
values we must first establish the corresponding: distribution. 
























170 E. J. GUMBEL 






2. Theory of the largest value. Let x be a statistical variable unlimited to 
the right having the distribution w(x). Among the N observed values, one will 
be larger than the others. We wish to determine its theoretical value. 

According to the principle of multiplication the probability Wy(z) that N 
values are inferior to z is 


(12) Wy(xz) = W(z). 


This is the probability of x being the largest value. The largest value is a new 
statistical variable which possesses a mode, a mean @, a standard deviation s 
and higher moments. To get the mean the distribution wy(x) of the largest 
value is needed. From (12) by differentiation 

(13) wy(z) = NW” (x) w(z2). 


The mode will be the solution of 





' w’(z) 

(13’) Te w+ = = 0. 

For a given initial distribution w(x) and for small N we have to solve this equa- 
tion. But the mean and the moments cannot be obtained in a general way by 
the use of the exact distribution (13). However we can reach general solutions 
if N is large, provided we limit ourselves to certain classes of initial distributions. 
We have studied this problem in previous publications [11-13]. For our present 
purpose it is sufficient to give the results in a form due to R. von Mises [18]. 

We define a large value u of the variable xz by 


(14) N(1 — W(u)) = 


This means that the expected number of observations equal to or greater than u 
isone. Equation (14) is but another form of definition (3). The mean number 
of trials is used in (3) whereas the original variable z is used in (14). 

The probability a du that a value greater than u will be contained between u 
and u + du is given by 


w(u) 
15 . >. 
(15) 1 — W(u) 
Obviously @ and u are functions of N and the constants in the initial distri- 
bution w(x). There are two limiting forms of the probability (12) 


lim W(x) = F(z); lim W* (xz) = Wz). 


N-o 
If 
(16) lim au=k>0, 


uo 


we ‘obtain 


(17) F(z) = oo. 
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This probability function was first established by Fréchet [5]. If 
(18) 


we obtain 
(19) I(x) it ore 


This probability function is due to R. A. Fisher [4]. Let us consider the first 
limit. The initial distributions which lead to it belong to the Pareto type. 
For this distribution 


1 


w(x) = W(z)=1- ae 


k 
zen? z21 
and condition (16) holds; for any value of x 


zw(r) 
i-Ww@ ~ “ 


The distribution f(x) of the largest value, which corresponds to (17), is 
- k u k+1 —(ufa)* 
(20 fee) = E(2)" arom, 


The mode Zy of the largest value is the solution of 


d u u\* 


k+1_ ku* 
2) ogkH? 
or 


k 1/k . 

(21) In = U (-*,) ° 

According to the definition (14) the mode of the largest value will increase 
with N. For a finite number of observations, which is always the case, the 
mode will be limited. But the moments of order k or higher will not exist. 
For k < 1, no moment will exist. Fork < 2, only the first moment, the mean, 
exists, and so on. 

Let us consider now the second limit (19). The initial distributions which 
lead to it belong to the exponential type. For this distribution [14] 


w(z) =e; W(x) =1-—e"; z20, 


za) -9 


and for any value of z 
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which means that condition (18) is fulfilled. Most of the distributions used in 
statistics belong to this type. According to (19) the distribution of the largest 
value is 

(22) (2) = ae 


If we introduce a reduced variable y without dimension by the linear trans- 
formation 


(23) y = a(x — u), 
we get the reduced probability B(y) 

Bly) = Wiz) 
(24) , 


=e°”, 


The numerical values of this function, calculated by means of Becker’s tables [1], 
are given in Table II, col. 1 and 2. The reduced distribution 


(25) vy) =e", 


makes clear the meaning of wu: the distribution has one and only one maximum 
which occurs for the reduced value y = 0. Therefore u is the mode of the 
largest value for a given set of N observations. For an initial distribution w(z) 
satisfying (18), and for large N, definition (3) of the return period as a function 
of x becomes identical with relation (14) which involves the number of observa- 
tions N and the corresponding most probable value w. 

We wish to decide which distribution of the largest value is to be used to 
represent the given observations. This decision depends, according to (16) and 
(18), on the nature of the initial distribution at the extreme values of the 
variable. If the law of the observed initial variable is known, a precise answer 
can be given. But generally speaking, a distribution chosen to represent given 
observations is nothing but an interpolation formula. Formulas having different 
analytical properties may all give satisfactory results. One might fulfill condi- 
tion (16), and another (18). The conditions apply to the differential coefficient, 
whereas the initial observations are always discontinuous. Therefore they will 
not enable us to decide which, if any, of the conditions is met. For extreme 
values of the variable z the observed differences are large and nonuniform, and 
there is therefore no way to replace the differentiation by a finite difference. 
Consequently we have to use the observations of the largest values to control 
the two competing theories and not the conditions. The fact that distribution 
(20) has higher moments only under certain conditions, is a strong practical 
argument in favor of distribution (22). Therefore the following development 
will be based on this distribution. , 

It can be shown that the mean error 6 of distribution (22) is related to the 
constant'a by ' 


(26). 6 = 0.98/a. 


Therefore the constant u is the most probable largest value for N observations 
and 1/a a multiple of the mean error. 
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TABLE II 
Probabilities and return periods of largest values 


Flood discharges per second 
reduced -_ ‘ sciidieeemeceiieliiiiaaieaiatadineaiiaadiiaasiiialiitianis 
: babilit t d 
variable — ” log T(z). in cubic meter /|in 1000 cubic feet 
z x 
Rhéne R. Mississippi R. 


SSERSERRBESES 


0. 
0 
0 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0 
1 
1 
1 
1 
1 
1 
1 
1 


NAUMPRwWY HE OSHID 
ESRSIRSS8sse 


> OLN HT Hm Hm IR I 09 09 09 09 BONBON Ee re 
SASRSARSRSARSRSASRSASRES 


0.99752 
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TABLE III 


Observed return periods 
Rhéne, Lyon (France) (1826-1936) 


Flood Serial ‘ Flood Serial Return 
: Return period ‘ ; 
discharge number ; discharge number period 


a. m San m log 'T (tm) 


899 
1172 
1231 
1272 
1272 
1432 
1432 





2475 313, 
2475 .321 
2475 .329 
2491 .338 
2514 .346 
2514 355 
2514 .364 
1439 2514 | .373 
1444 | ‘ 2538 .382 
1502 ; | 2554 .392 
1541 , | 2586 | 
1560 | 2594 
1639 | | 2594 
1706 | 4 | 2594 
1780 ' | 2602 
1829 5 | 2626 
1850 | | 2627 
1857 | | 2643 
1913 | 2675 
1913 . | 2675 
1934 | ; 2773 
1955 | j 2773 
1992 | 2773 
1992 2839 
2006 | ; | 2856 
2006 | , | 2881 
2013 | | 2881 
2050 . | 2965 
2050 | ; | 3007 
2072 | 3050 
2094 | ; 3058 
2101 | 3067 
2115 | | 3067 
2145 | | 3126 
2145 | 3179 
! 3214 
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TABLE III—Concluded 


Flood Serial Return period Flood Serial Return 


ischarge number ‘ discharge number riod 
a ln 7 log ’T(zm) Tm , m log 'T (2m) 
2160 . 93 . 
2168 ° 94 .825 
2175 : 95 
2206 ° 96 
97 
98 
99 
100 
101 
102 
103 


105 


SES 3) 


— 
nS 
vo 


106 
107 
108 
109 
110 
111 


~I no 
ESRESS 


2452 
2467 


Ltm = 276,773. 22, = 744,538,565. 
The arithmetic mean @ of distribution (22) is [4] 


(27) a=ut<, 

where c = 0.5772157 is Euler’s constant. The standard deviation s is 
(28) 8 = r/av/6. 

Therefore 

(29) a = u + 0.45005s. 


The reduced variable y introduced by (23) is related to the reduced variable 
8 
by 
z= a6 (2 — cV6 
T T 


u) — 
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The substitution of the numerical values leads to 


(30’) z = 0.77970y — 0.45005. 
Conversely, 


(31) y = 1.282552 + 0.57722. 


The value (32) v = s/d, the coefficient of variation, is related to the product 
au. By (27) au = at — cand by (28) 


_- tl 
au = J/6 v c 
Therefore the numerical value of au can also be considered as a characteristic 


of an observed distribution of largest values. 


For the two constants we calculate for the observed distribution of largest 
values the two first moments 




















(33) 


(34) 
and 
(35) 
To get the observed standard deviation we use the Gaussian formula 


(36) ce (1+ )@-0 ;)@ a 7’). 


According to (28) and (27) 
= 0.7796968s, 





(37) 


21 = 


and 
(38) will en 0.572157 


U 
a 





These formulas give the two constants in the distribution of largest values. 















3. Flood flows interpreted as largest values. We will now apply the theory 
of largest values to flood flows. Let us consider the daily flow as a statistical 
variable, unlimited to the right. This idea is not new. The formulas proposed 
by Fuller [7], Hazen [15], and numerous other authors all incorporate this 
assumption. Gibrat [9] supposes that the daily flows vary according to Galton’s 
distribution. Instead of postulating a specific formula for the distribution of 

‘ flows we shall only suppose that it belongs to the usual exponential type, which 
means that condition (18) is fulfilled. 

We define a flood as being the largest value of the N = 365 daily flows. The 
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flood flows are therefore the largest values of flows. This commonplace implies 
the distinction between floods and inundations. For each year there exists one 
or more floods of the same magnitude, but there might exist several different 
inundations or none at all. If there are several inundations in a year the 
greatest one will be a flood; but a flood need not to be an inundation: even a 
dry year has a flood. We limit ourselves to floods, assume that N = 365 is a 
large number, and represent the distribution of annual floods by the distribution 
(22) of largest values. 

There have been objections to the concept that the daily flow is an unlimited 
variable. Horton [16] believes that this implies the absurd idea of unlimited 
floods. This opinion is shared by Slade [20], who claims that there is a definite 
upper limit to the magnitude of the floods for a given stream. The theory of 
largest values confirms only partially Horton’s opinion. If we should choose 
distribution (20), the most probable annual flood will be limited. For this 
distribution, however, it might happen that the mean annual flood has no 
meaning. To avoid this we have chosen distribution (22), for which the mean 
annual flood and all the moments will be finite. A further justification of the 
use of (22) might be derived from the fact that Galton’s distribution belongs to 
the exponential type. As a final argument, numerical calculations show that 
formula (22) gives a better fit to the observed distributions of flows. 

The variable z is the annual flood flow measured in cubic meters or cubic 
feet per second. The mean @ is the annual mean flood, whereas u is the most 
probable annual flood. The value s is the standard deviation of the distribu- 
tion of annual floods. Finally y is called the reduced flood. 

The distribution (22) possesses the properties of the observed distribution of 
flood flows. It is asymmetrical; rising rather quickly but falling rather slowly. 
The modal value is to the left of the mean (see Fig. 3). 

To apply the theory of return periods let us consider the event of the highest 
annual discharge being greater than z. We have to replace in formula (3) the 
general probability W(x) by the probability of flood discharges (19). The 
number of observations n is the number of years for which observations exist. 

To use formula (3) we have to suppose that the intervals between the suc- 
cessive floods are all equal to one year. This assumption conforms more or less 
to the seasonal nature of floods. 

The return period of a flood greater than x 


(39) T(x) = — 


is the arithmetic mean of the intervals between two years, which have a flood 
discharge greater than x; the discharges for the intervening years are all less 
than x. Therefore 7'(x) is the mean of the number of years for which z will be 
surpassed once. Formula (39) gives the meaning of u from the standpoint of 
the return period. For y = 0 


T(u) = ——. 


e—1° 
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The return period 7(u) of the most probable annual flood is 1.58198 years. In 
other words, the constant u is the flood discharge with return period 


(40) log T(u) = 0.19920 


where log signifies the common logarithm. The return period of the mean 
annual flood is by (27) and (39) equal to 2.32762 years. 

Let us now consider the relation between the flood discharge x and its return 
period for small and large values of xz. To small values of x correspond large 
negative values of y and therefore return periods 7 approximating 1. The 
distribution (25) of the largest values being unlimited, the flood discharge con- 
sidered as a function of log 7 will by (6) increase rapidly at first. To large 
values of xz correspond large values of y and T(z). If we introduce the natural 


logarithm, (39) gives 
_— _— 1 oan i 
In ( Ks) me”. 


For large values of x, viz., T(x) 2 10, it is sufficiently accurate to use 


1 _yy 
T@) ° ’ 






so that 
(41) 










y = In T(z). 
If the common logarithm is used, 
(42) log T(x) = 0.434294a(z — u). 


The logarithm of the mean number of years for which the flood discharge will 
once be exceeded, converges towards a linear function of z. This property of 
the distribution of largest values was established by M. Coutagne [2]. Let us 
write 





(43) z=ut —— log T(z). 
Then 1/a can be considered as a measure of the increase of a flood discharge 
with respect to the logarithm of time. 

According to the general formulas (6) and (42) the shape of the return period 
as a function of the flood discharge z is as follows: at the beginning i.e., for small 
flood discharge, the return periods are close to 1 and increase very slowly. At 
the end, i.e., for large flood discharges, the logarithm of the return period con- 
verges to a linear function of z. ° 

Another form of (43) is 







2.30258 


a log T(z). 





(44) 7=1+ 
U 





The ratio of the flood discharge which will be exceeded in the mean once in 7’ 
years to the modal annual flood converges to a linear function of the logarithm 
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of the return period. The constant 1/au of dimension zero depends, by (33), 
on the coefficient of variation. Its value is a characteristic of the stream. If 


we introduce the arithmetic mean @ and the standard deviation s we obtain 
by (42), (27), and (28) 


x = & — 0.45005s + (0.77970) (2.30258)s log T(z). 


Therefore, approximately, 


x 9 
(45) 7 1— 30” + 1.796v log T(z). 


The right hand member of this linear equation contains only one constant, the 
coefficient of variation of the floods. Finally by (42) and (31) 
z—t 

ae 


(46) log T(x) = 0.25068 + 0.55700 


There is still another way of interpreting these asymptotic formulas. Let 
T(2x) be the return period of the value 2z, then by (43) 


In T (22) 
a 


2x => ut 


’ 
therefore 


au + In T(2z) 


*= qu + In TQ)’ 


and finally 
(47) T(2x) = T’(x)e™. 


The return period of a flood of magnitude 2z is equal to the square of the 
return period of x multiplied by a factor which depends only upon the coefficie:1t 
of variation. 

All these asymptotic formulas are good approximations only for return perio-ls 
above ten years, which means according to Table II, y 2 2.25 or according 
to (23), (30) and (31) tc 2 &@ + 1.3s. The corresponding value of the flood 
probability is by (3) W(x) 2 0.9. The consequences of (41) can be applied to 
only 10% of the observations, i.e. to the large flood discharges. Their observed 
return periods are based on a few observations and may therefore differ con- 
siderably from the theoretical values. In spite of the above restrictions the 
linear formula (43) has a meaning for values of 7’ equal to or greater than unity. 
We now ask: How will the most probable largest value increase with the number 
of observations? This number of years can again be called 7. The answer to 
the above question requires the solution of (13’) where the distribution (25) of 
largest values »(y) must be introduced as the initial distribution w(z). 

From (24) 


T-1 ~~ a 
ewe’ —-l+e”"=0, 
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Te * = 1, 


which is identical with (41). For T = 1 the most probable annual flood is of 
course u. Therefore the relation (41), valid for 7 = 1, means: The most prob- 
able flood u(7’) to be reached within 7’ years is a linear function of the logarithm 
of T 


2.30258 log T 


a 


(41’) u(T) =ut+ 


The constant 1/a is the slope of this straight line. The results (41-46) are 
related to Fuller’s well-known formula [6]. This author, the first to investigate 
flood flows systematically, proposed a linear relation between the logarithm of 
the return period and the arithmetic mean of the flood discharges greater than 
the mth value (m taken from above). -A similar empirical formula has been 
stated by Lane [7] and has been applied by Saville [19]. The similarities and 
differences between these interpolation formulas and our theory can be stated 
in the following way: If we start from the theory of largest values we reach 
these formulas as asymptotic expressions for the return period of large floods. 
Considered this way, our theory gives a certain justification to Fuller’s hypothe- 
sis. But Fuller’s and similar formulas were intended to apply to all flood 
discharges. Now, the distribution of the flood discharges (4) corresponding to 
these return periods does not fit the observations. It can be shown that these 
formulas involve the assumption of a simple exponential distribution g(x) for 
the flood discharges 


(48) g(x) — satis srs. 
uU — 


€ 


and the existence of a lower limit ¢ of the flood discharges given by « = @ — 8. 
In Fuller’s formula all flood discharges must be greater than 2/3 of the mean 
annual flood. The density of probability always diminishes with increasing 
magnitude of the flood. This neglects the ascending branch (about one third) 
of the distribution of floods (see Fig. 3) and is incompatible with the observed 
facts. We therefore prefer our formula which takes account of the total varia- 
tion, but we do not minimize the importance of Fuller’s work which has led to 
much valuable research. 

Formula (39) gives the theoretical return periods T(x) as a function of the 
reduced flood discharge y, and holds for the entire range of observations. The 
general numerical values are given in Table II, cols. 1 and3. Fora given stream, 
the return period of a flood discharge greater than x depends by (23) upon the 
two constants aand u. If these values have been calculated by (37) and (38) 
the theoretical flood discharge x corresponding to T(z) is obtained by the 
linear transformation 


(49) zr=urt y/a. 
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The asymptotic formula (42) suggests the coordination of the flood discharges 
z and the logarithm of the return periods. 


4. Rhéne and Mississippi Rivers. We think that our system of formulas is 
simple, logically consistent and free of artificial assumptions. Now it remains 
to be shown that the arithmetic involved is simple and that the results fit the 
observations. For the Rhéne we shall analyze the observed cumulative fre- 
quency, the distribution, and the return periods. For the Mississippi River 
we shall limit ourselves to the return periods. 

For each year we choose the maximum of the daily discharges (we do not use 
momentary peaks). The 111 values z, for the Rhéne 1826-1936 published by 
Coutagne [3] and arranged in order of increasing magnitude are given in Table III 
(col. 1). The supposition that the intervals between consecutive floods are all 
equal to one year is not always true. Only 77 of the 111 floods occurred between 
October and March, whereas 34 were scattered throughout the year. But the 


TABLE IV 
Calculation of constants 
Stream observation station......... | Rhone Lyon | Mississippi River 
(France) Vicksburg (Miss.) 


1826-1936 1890-1939 
Number of observations 111 50 


Annual mean flood............... 2,493.5 1 355.6 


Mean squared flood 6,707 ,555.0 | 1 ,951 ,828.8 ' 
Standard deviation............... 703.1 341.3 


266.1 


Most probable annual flood...... .u 


differences in the lengths of the intervals compensate each other. The second 
column of Table III contains the serial number m. According to (9) we calcu- 
late for the mth observed flood discharge z,, taken in ascending magnitude, 
the logarithm of the observed return period log n/(n — m) (col. 3), where n = 111 
and m = 1, 2, ---, 110, and obtain the exceedance intervals. The other 
observed curve, the recurrence interval, is obtained by (10) through the coor- 
dination of %m4; and log n/(n — m). Both curves are plotted in Fig. 1. The 
recurrence and exceedance intervals differ for the large flood discharges. The 
observed flood discharges arranged in increasing magnitude are plotted in the 
cumulative histogram, Fig. 2. 

To compare these observations with our theory, we calculate the two con- 
stants 1/a and u according to the formulas (34)-(38). The values =z, and 
rz. are given at the end of Table III. Division by n = 111 gives the mean 
flood % and the mean squared flood u’* (Table IV). The Gaussian correction 
being 1 + 1/110 we obtain from formula (36) the standard deviation s (Table IV) 





TABLE V 


Observed and theoretical distributions of flood discharges 
Rhéne 


Reduced 
variable 
y 


—2.75 
—2.50 807 0 
—2.25 0. 
—2.00 1081 0. 
| 0. 
1 
3 


Variabl Midpoints Observed Theoretical | Cumulative 
~ » os or distribution | distribution | frequency 
2 111A4’28(z) 111AW(z) 111Q(z) 


.00- 
01 
07 

=} 2% 35 
1355 26 
38 
1629 7.33 
13.36 

1903 21.35 
| 30.74 

2177 . 40.84 
50.95 

2451 60.52 
69.21 

2725 76.83 
83.35 

2999 88.80 
93 .29 

3273 ; 96.95 
99.90 

3547 ; 102.25 
3. 104.13 

3822 , 105.70 
106.78 

4096 ; 107.70 
108 .42 
108.98 
109.43 
109.77 

~ 110.04 
110.25 
110.42 
110.55 
110.65 
110.73 


1.00 
1.25 . 
1.50 
1.75 
2.00 
2.25 
2.50 
2.75 
3.00 
3.25 
3.50 
3.75 
4.00 
4.25 
4.50 
4.75 
5.00 
5.25 
5.50 
5.75 
6.00 
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and finally from (37) and (38) the constant 1/a and the most probable annual 


flood u. From the numerical values in Table IV the linear transformation (49) 
for the Rhéne is 


x = 2177.03 + 548.19y. 


TABLE VI 
Observed return periods 
Mississippi River, Vicksburg, (Miss.) (1890-1939) 
Flood Seriai 


| <q | Flood | Serial | Return 
discharge number | Return period || | 


| log’ T(em) | ~—— | — | teas 


lm 


| 
| 
| 


760 
866 
870 
912 
923 
945 
990 


| 1357 26 .3188 
| 1457 2% @8=6|_~—s«w. 8278 
' 1397 | 28 | .3566 
1397 29 .3768 
1402 | 30 | 3980 
1406 31 | .4202 
1410 | 32 |  .4437 
994 1410 | 33 |  .4686 
1018 1426 | 34 |  .4949 
1021 | 1453 | 35 |  .5229 
1043 | .1079 || 1475) «| 36 |  .5529 
1057 — | 1480 | 37 | .5851 
1060 ' | 1516 38 =| ~—-.6198 
1073 | 1516 | 39 | 6576 
1185 1549 | 1536 | 40 .6990 
1190 1578 41 .7448 
1194 = | 1681 42 7959 
1212 1939 || = 1721 43 .8539 
1230 ; | 1813 44 .9208 
1260 2219 || 1822 45 1.0000 
1285 , | 1893 46 1.0969 
1305 ; | 1893 47 1.2219 
1332 | 2040 | 48 1.3980 
1342 | 2056 49 

1353 2334 50 


oonoaoanrkwnd re 

















tm = 67,780. arr, = 97,591,440. 


This leads to the determination of the theoretical flood discharges. The theo- 
retical return periods log T(z) are given in Table II, col. 3 as a function of the 
reduced variable y and of z (col. 4). The discharges x obtained by letting 
y take on the values —2.75 to 6.00 in the linear transformation, are given in 
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Table V, cols. 2 and 3 and plotted in Fig. 1. The distances Az used in the 
calculations of the theoretical discharges are 1/4a = 137.05. 

Along the abscissa are plotted the logarithm of the return periods and the 
return periods in years; along the ordinate are plotted the corresponding flood 
discharges and the modal annual flood u. The straight line from the point (u, 0) 
to the asymptote gives the most probable flood as a function of time. The 
theoretical curve corresponds quite closely with the general course of the ob- 
servations. For small floods the theoretical return periods are practically iden- 


To) 


Fie. 1. Raoéne at Lyon (FRANCE) 1826-1936 
Observations Table III: Recurrence intervals, + — — +; Exceedance intervals, 
° e; Return periods, ; Theory Table II, cols. 3 and 4: Extrapolation, — —. 


tical with the observed values. But for the very large floods the theoretical 
curve surpassed both the exceedance and recurrence intervals. 

The observed cumulative histogram is shown in Fig. 2. We calculate from 
Table II, col. 2, the frequencies 111%8(z) (Table V, col. 6). These theoretical 
values (x, 11128(z)) are also plotted in Fig. 2. The agreement between theory 
and observations is very good. 

For the comparison of the observed and theoretical distributions of the flood 
discharges we use what might be called the natural classification. For the 
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observations, the length of the class intervals and the beginning of the first class 
interval are arbitrary. In order to obtain the observed distribution of the flood 
discharges, it is natural to use the theoretical class intervals set forth in Table V, 
col. 2. The data of the third column can be interpreted as the midpoints of the 
class intervals given in col. 2. The frequencies for these class intervals are ob- 


eo 


Fie. 2. CUMULATIVE FREQUENCY OF THE FLOop Discuarces. Rxdéne, Lron (Francs) 
1826-1936 
Observations Table III cols. 1 and 2, e—e; Theory Table V cols. 2, 3 and 6, / 


tained from Tabie III, and are given in Table V, col. 4. The observed distribu- 
tion is shown in Fig. 3. To obtain the corresponding theoretical distribution we 
calculate from Table V, col. 6, the difference between two cumulative frequencies 
disjoined by one, i.e., we pair consecutively the first and third, the second and 
fourth items and so on. This theoretical distribution given in col. 5 and the 
observed distribution are based on class intervals of the same length. Fig. 3 
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shows that the theoretical distribution AY(zx) of the largest values agrees in a 
satisfactory way with the observed distribution A’2%8(z) of the flood discharges, 
Table VI, col. 1, gives the corrected’ flood discharges zm, measured in units of 
1000 cubic feet per second, for the Mississippi River at Vicksburg (1890-1939), 
(n = 50), arranged according to increasing magnitude; col. 2 gives the serial 
number m. We calculate the logarithm of the observed return periods log 
n/(n — m), (col. 3). The observations (rm , log ’T(xm)) and (%m41, log ’T (rm) 
are plotted in Fig. 4. The constants obtained by formulas (34)—(38) are shown 


Fic. 3. DistTRIBUTION OF THE FLoop DiscHarGes. RHONE, LYON (FRANCE) 1826-1936 
Observations Table V cols. 2, 3 and 4, [1]; Theory Table V cols. 2, 3 and 5, ~ 


in Table IV. By (49) the theoretical floods x corresponding to the return 
periods T(x) presented in Table II, col. 3, are 


x = 1201.98 + 266.14y. 
These floods are given in Table II, col. 5. The class interval used is 


1/4a = 66.5. 


? These data have been put at my disposal through the courtesy of Mr. A. E. Brandt of 
the U. 8S. Department of Agriculture. 
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The theoretical curve (x, log T(x)), plotted in Fig. 4, agrees in a very satisfactory 
way with the observations. For the large floods the theoretical return periods 
are between the exceedance and recurrence intervals. 

The calculations of the theoretical return periods for other streams, e.g. the 
Columbia, Connecticut, Cumberland, Rhine, and Tennessee Rivers, for which 
reliable observations exist for more than 60 years, also show a good agreement 
with the observations. The goodness of fit diminishes for streams for which 
the number of observations is smaller and for which the data are not very 
reliable. 


2 4 
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Fic. 4. Mississipp1 River at VicksBurG, (Miss.) 1890-1939 


Observations Table VI: Recurrence intervals, + — — +; Exceedance intervals, 
e——e; Return periods, - ; Theory Table II, cols. 3 and 5; Extrapolation, — —. 


5. Summary and conclusions. In order to apply any theory we have to sup- 
pose that the data are homogeneous, i.e. that no systematical change of climate 
and no important change in the basin have occurred within the observation 
period and that no such changes will take place in the period for which extra- 
polations are made. It is only under these obvious conditions that forecasts 
can be made. 

The theoretical return period T(x), the mean number of years between two 
annual flood discharges greater than or equal to 2, is a statistical function such 
as the distribution w(x) or the probabilities W(x) and P(x). There are two 





188 E. J. GUMBEL 


sets of observed values corresponding to the theoretical set. The exceedance 
interval ‘7'(z») formula (9), and the recurrence interval 7'(zm) formula (9’); 
Im being the mth flood discharge, where m is counted from below. As any 
theory must include both notions, no separate theory for exceedance or recur- 
rence intervals is possible. 

The return period 7'(x) of a flood discharge z is found by formula (39). For 
large values of z the flood diseharge converges toward a linear function (42) of 
the logarithm of the return period. This is the scientific basis of Fuller’s em- 
pirical formula. The two constants of our formula u and 1/a, are, respectively, 
the most probable annual flood discharge and a multiple of the standard devia- 
tion (28). Their values depend upon the drainage basin and known geological 
and meteorological factors. It is beyond our present task to consider the influ- 
ence of these factors. Our method can be summarized by the following rules: 

1) For each year find the maximum daily discharge z,, (do not use momentary 
peaks) and arrange these n data in increasing magnitudes. 

2) Calculate for each discharge z, (m = 1, 2,---,n — 1), the values log 
'T(am) = log n — log (n — m) and plot the curves z,, log n/(n — m), and 
Zm+i, log n/(n — m). These are the observed exceedance and recurrence 
intervals. 

3) Calculate the annual mean flood @ and the annual mean squared flood w’; 
determine according to (36)—(38) the standard deviation 


WV (i+)e-0 


and the two constants 


1/a = 0.77970s, 
_ 0.57722 
== ° 


a 


4) The theoretical flood discharges x corresponding to the logarithm of the 


return period 7'(z) given in Table II, col. 3, are obtained by the linear trans- 
formation 


r=urty/a 


where y is taken from Table II, col. 1. Plot z as a function of log T(z). For 
large. values of z and for extrapolation it is sufficient to use the linear asymptote 
obtained graphically. 

The linear part of the theoretical curve (zx, log 7’) permits of two interpreta- 
tions: First, 7 is the theoretical return period of a flood greater than or equal 
to x; second, x is the most probable flood to be reached within 7 years. The 
second interpretation holds for the straight line through the point (u, 0). 

The figures show a close agreement between observed and theoretical values. 
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The observed curvature of the return periods is brought out by the theoretical 


h. 
O The agreement between theory and observation is excellent for floods which 
correspond to reduced values of y S$ 3. For the two or three extreme floods, 
the return periods are based on a few observations and, consequently, the agree- 
ment is not very good. No theory can be verified by two or three observations. 
Generally speaking, the theory fits the observations as closely as could be ex- 
pected for such a complicated phenomenon. 

In order to make a further test of our results, we need a numerical measure 
for the weights to be given to the theoretical points. Therefore, for a given 
probability we must find the corresponding theoretical limits for the observed 
return periods. The theory of positional values will give these control curves. 
Since it was the purpose of this article to develop and make clear the basic 
method, we have refrained from introducing this subject. 

It is our claim that the calculus of probabilities and especially the theory of 
largest values, is an efficient tool for the solution of certain hydrological problems. 
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ON THE FOUNDATIONS OF PROBABILITY AND STATISTICS’ 


a 


By R. von Mises 


Harvard University 


1. Introduction. The theory of probability and statistics which I have been 
upholding for more than twenty years originates in the conception that the only 
aim of such a theory is to give a description of certain observable phenomena, 
the so called mass phenomena and repetitive events, like games of chance or 
some specified attributes occurring in a large population. Describing means 
here, in the first place, to find out the relations which exist between sequences 
of events connected in some way, e.g. a sequence of single games and the sequence 
composed of sets of those games or between a sequence of direct observations 
and the so called inverse probability within the same field of observations. The 
theory is a mathematical one, like the mathematical theory of electricity, based 
on experience, but operating by means of mathematical processes, particularly 
the methods of analysis of real variables and theory of sets. 

We all know very well that in colloquial language the term probability or 
probable is very often used in cases which have nothing to do with mass phe- 
nomena or repetitive events. But I decline positively to apply the mathemati- 
cal theory to questions like this: What is the probability that Napoleon was a 
historical person rather than a solar myth? This question deals with an iso- 
lated fact which in no way can be considered as an element in a sequence of 
uniform repeated observations. We are all familiar with the fact that, e.g. the 
word energy is often used in every day language in a sense which does not 
conform to the notion of energy as adopted in mathematical physics. This 
does not impair the value of the precise definition of energy used in physics and 
on the other hand this definition is not intended to cover the entire field of daily 
application of the term energy. 

We discard likewise the scholastic point of view displayed in a sentence of this 
kind: “‘. .. that both in its meaning and in the laws which it obeys, probability 
derives directly from intuition and is prior to objective experience.” This 
sentence is quoted from a mathematical paper printed in a mathematical journal 
of 1940. The same author continues calling probability a metaphysical problem 
and speaking of the difficulties ‘which must in the nature of things always be 
encountered when an attempt is made to give a mathematical or physical solu- 
tion to a metaphysical problem.” In my opinion the calculus of probability 
has nothing to do with metaphysics, at any rate not more than geometry or 
mechanics has. 































1 Address delivered on September 11, 1940 at a meeting of the Institute of Mathematical 
Statistics in Hanover, N. H. 
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On the other hand we claim that our theory, which serves to describe ob- 
servable facts, satisfies all reasonable requirements of logical consistency and js 
free from contradictions and obscurities of any kind. I am now going to outline 
the essential ideas of the theory as developed by me since 1919 and I shall have 
to refer as to the proof of its consistency to the recent work of !A. H. Copeland, 
of J. Herzberg and of A. Wald. Then I will give some examples of application 
in order to show how the theory works and how it applies to actual problems in 
statistics. . 


2. The notion of kollektiv. The basic notion upon which the theory is estab- 
lished is the concept of kollektiv. We consider an infinite sequence of experi- 
ments or observations every one of which supplies a definite result in the form 
of a number (or a group of numbers in the case of a kollektiv of more than one 
dimension). We shall designate briefly by X the sequence of results 2, %, 
Z3,-°--. In tossing a die we get for X an endless repetition of the integers one 
to six, x = 1, 2,--- 6. If we are interested in death probability, we observe a 
large group of healthy 40 year old men and mark a one for each individual sur- 
viving his 41st aniversary and a zero for each man who dies before, so that the 
sequence 21, 22, 23, --- consists of zeros and ones. In a certain sense the 
kollektiv corresponds to what is called a population in practical statistics. Ex- 
perience shows that in such sequences the relative frequency of the different 
results (one to six in the first of our examples, one and zero in the second) varies 
only slightly, if the number of experiments is large enough. We are therefore 
prompted to assume that in the kollektiv, i.e. in the theoretical model of the 
‘ empirical sequences or populations, each frequency has a limiting value, if the 
number of elements increases endlessly. This limiting value of frequency is 
called, under certain conditions which I shail explain later, the “probability of 
the attribute in question within the kollektiv involved.” The set of all limiting 
frequencies within one kollektiv is called its distribution. 

Let me insist on the fact that in no case is a probability value attached toa 
single event by itself, but only to an event as much as it is the element of a well 
defined sequence. It happens often that one and the same fact can be considered 
as an element of different kollektivs. It may then be that different probability 
values can be ascribed to the same event. I shall give a striking example of this, 
which we encounter in the field of actual statistical problems, at the end of this 
lecture. 

The objection has been made: Since all empirical sequences are obviously 
finite sequences, why then assume infinite kollektivs? Our answer is that any 
straight line we encounter in reality has finite length, but geometry is based on 
the notion of infinite straight lines and uses e.g. the notion of parallels which 
has no sense, if we restrict ourselves to segments of finite lengths. Another 
objection, often repeated, reads that there is a contradiction between the exist- 
ence of a frequency limit and the so called Bernoulli theorem which states that 
sequences of any length showing a frequency say } can also occur in cases for 
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which the probability equals 4. But it has been proved, in a rigorous way ex- 
cluding any doubt, that the two statements are compatible, even by explicit 
construction of infinite sequences fulfilling both conditions. I would evenclaim 
that the real meaning of the Bernoulli theorem is inaccessible to any probability 
theory that does not start with the frequency definition of probability. 

Now we are in the position to explain how our probability theory works. 
This sequence of zeros and ones 


(X) 101{001{100/011{110{011|010j111.-.-. 


may represent the outcomes of a game of chance. The ones show gains, the 
zeros losses for one of the two players. If we separate the terms of X into groups 
of three digits and replace each group by a single one or zero according to the 
majority of terms within the group, we get a new sequence 


(X’) 10011101... 


which represents the gains and losses in sets of three games. Our task is now 
to compute the distribution, i.e. the limiting frequencies of zeros and ones in 
this new sequence X’, assuming the two frequencies in X are known. A sequence 
can formally be considered as a unique number like a decimal fraction with an 
infinite number of digits. Then the transition from X to X’ can be called a 
transformation of a number X’ = T(X). As our sequences have to fulfill certain 
conditions Copeland calls the sequences X, X’ admissible numbers. What I 
just quoted was of course a very special example of a transformation of a number. 
But we have to emphasize that all problems dealt with in probability theory, 
without any exception, have this unique form: The distribution or the limiting 
frequencies in certain sequences are given, other sequences are derived from the 
given ones by certain operations, and the distributions in these derived sequences 
have to be computed. In other words: Probability theory is the study of trans- 
formations of admissible numbers, particularly the study of the change of distribu- 
tions implied by such transformations. 

We know four and only four simple, i.e. irreducible transformations or four 
fundamental operations. They are called selection, mixing, partitioning and 
combination. By combining these basic processes we can settle all problems 
in probability theory. The formal, mathematical difficulties in carrying out the 
computation of the new distributions may become very serious in certain cases, 
particularly if we have to apply an infinite number of transformations (asymp- 
totic problems). But, in the clearly defined framework of this theory no space 
is left for any metaphysical speculations, for ideas about sufficient reason or in- 
sufficient reason, for notions like degree of evidence or for a special kind of prob- 
ability logic and soon. And further no modification is needed for handling usual 
statistical problems: Terms like inverse probability, likelihood, confidence 
degrees, etc. are justified and admitted only as far as they are capable of being 
reduced to the basic notion of kollektiv and distribution within a kollektiv. I 
will give some more details to this point later. Meanwhile let me turn to a 
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general question which, in a certain way, is the crucial point in establishing the 
new probability theory. 


3. Place selections and randomness. It is obvious that we have to restrict 
still further the notion of kollektiv or the field of sequences which can be con- 
sidered as the objects of a probability investigation. The successive outcomes 
of a game of chance differ very clearly from any regular sequence as defined by a 
_ simple arithmetical law, e.g. the regularly alternating sequence 0 1 0 1 
0101 ---. <A typical property which singles out the irregular or random 
sequences and which has to be reproduced in every probability theory is that, if 
p is the probability of encountering a one in the sequence, then p’ is the prob- 
ability of two ones following each other immediately. Any probability theory has 
to introduce an axiom which enables us to deduce this theorem and others of a 
similar type. The question is only how to find a sufficiently general and con- 
sistent form for it. The procedure I have chosen consists in using a special kind 
of transformation of a sequence, which I call a place selection. 

A place selection is defined by an infinite set of functions s(x; , Ze, +++ ns) 
where 21, 22, 23, --+ are the digits of an admissible number or a kollektiv and 
8, has one of the two values zero or one. Here s, = 1 means that the nth digit 
of the sequence is retained, s, = 0 means that it is discarded. The decision 
about retaining or discarding the nth elements depends as you see, only on the 
preceding values x; , 22, --- 2,1 , but not on z, or the following digits. Example 
of a place selection: 


8, = 1, if z,, = 0 for prime numbers n, 
if Zn. = 1 for n not prime, 
8, = 1, and s, = 0 in all other cases. 


Experience shows that, if we apply such a place selection to the sequence X 
of outcomes of a game of chance, we get a new, selected sequence S(X) in which 
the frequencies of gains and losses are about the same as in X. This fact or 
the practical impossibility of a gambling system suggests the adoption of the 
following procedure in handling transformations of admissible numbers. 

First, if within a certain investigation the transformation applied to X is a 
place selection, we assume that the distribution in X¥’ = S(X) is the same as 
in X: distr S(X) = distr X. Second, if a general transformation T' is applied 
to X, say X’ = T(X), then we examine whether the existence of a place selection 
S that changes the distribution in X’ (so as to have distr S(X’) # distr X’) 
implies the existence of a place selection S, that would affect the distribution in 
X (so as to give distr S,(X) = distr X). If this is the case, we say that X’ is 
a kollektiv, provided that the original sequence X was considered to be a kollek- 
tiv. Take e.g. for X the sequence resulting from tossing a die endlessly, and 
call p: , pe, --- pe the limiting frequencies of the six possible outcomes 1, 2, -- - 6. 
The transformation T may consist in replacing every 1 in the sequence X by a 
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2, every 3 by a 4, and every 5 by a6. The new sequence consists of only three 
different kinds of elements 2, 4, 6 and therefore its distribution includes only 
three values p2, ps, Pe Where evidently po = Pi + po ete. Here it is almost 
obvious that if a place selection applied to X’ changes the value of pz , the same 
selection if applied to X must change either p, or p,. So, if the original sequence 
X was considered as a kollektiv, X’ has to be admitted too. 

Now the question arises whether this procedure is in itself consistent or 
whether it can lead to contradictions. We were concerned up to now with 
kollektivs the elements of which belong to a finite set of distinct numbers 
é:, €, +++ & and the distributions of which are therefore defined by k non- 
negative values p; , po, --+ pe With the sum 1. In this case it was pointed out 
by Wald and by Copeland that, if an arbitrary distribution and an arbitrary 
countable set 2 of place selections are given, there exists a continuum of se- 
quences every one of which has the given distribution, which is not affected by 
any place selection belonging to 2. Now it may be supposed that in a concrete 
problem a sequence X’ is derived from a sequence X by a finite number of 
fundamental operations involving a finite set 2’ of place selections. Another 
finite set 2’’ may consist of selections employed in establishing that certain 
sequences used in the derivation of X’ are “combinable” ones. Finally an 
arbitrary countable set > of selections S may be assumed. According to our 
procedure we have shown that to any place selection S which affects the distribu- 
tion in X’ corresponds a certain S,; which, when applied to X, changes the dis- 
tribution of X. All these S, corresponding to the elements S of 2 form a 
countable set 2:1. Now the set 22 including 2’, 2’’, 2; and also including all 
products of two of its own elements is a countable set too. What we use in 
computing the distribution of X’ is only the fact that the given sequence X is 
unaffected by the selections that are elements of 22. It follows from the above 
quoted results that we can substitute for X a numerically specified sequence 
and carry out all operations upon this specified sequence. So it is proved that 
no contradiction can arise in computing the final probability according to our 
conception. 

I cannot enter here into a discussion of the more complicated case where the 
range within which the elements of a kollektiv vary, is an infinite one, either a 
countable set or a continuum. All principal problems connected with estab- 
lishing the notion of kollektiv can be settled satisfactorily, at any rate, by con- 
sidering those general forms of sequences as limiting cases of kollektivs with a 
finite set of attributes. 






4. Example: Set-of-games problem. 
instructive example to show how the theory works and what task a mathematical 
foundation of the calculus of probability has to achieve. Let us recall the two 
sequences X and X’ composed of zeros and ones of which we spoke above. The 
first represented the outcomes of a sequence of single games, the second the 
outcomes of triple sets of those games. If X is considered as a kollektiv with 


I want to present now a simple, but 
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given probabilities p and q for one and zero, it is easy to deduce the correspond- 
ing values p’ and q’ for X’ and to show that X’ is a kollektiv too. We begin by 
carrying out three selections which single out from the original sequence 2, , 
Zo, X%3--- first, the elements 2; , x4, 77, --- second, the elements 22 , 2s , 2%, «+. 
and third, the elements x3 , %, 2%, ---. It can be shown by means of certain 
further place selections that these three kollektivs which we call X1, X2, X; 
are combinable. That means that combining the corresponding elements of 
the three sequences like 2:2273 , TTsTs , T7%sT9, --- leads to a new three dimen- 
sional kollektiv Xo in which each permutation of three digits 0 and 1, has a 
probability equal to the corresponding product of p- and g-factors. For in- 
stance the probability of encountering the group 111 is p* and for the group 110 
it is pg. Now we operate a mixing upon X¢ by collecting all permutations 
with two or three ones. We find in a well known way the sum p’ + 3p’ for 
the probability p’ of ones in the sequence X’. So far the result is very well 
known and can be reached—in my opinion, in a very incomplete and unsatis- 
factory way—also by the classical methods. 

But what I want to discuss here is a slightly modified question. If the 
sequence X means gains and losses for single games and if the arrangement for 
sets of three games is made as indicated before, then in a real play the gains 
and losses of sets are counted in a different way. For, if the first two games of 
a set are both won or lost by the same player, the fate of the set is decided and 
there is no sense to play the third game. So the loss of the second set in our 
example will already be recognized after the fifth game and the actual sixth 


game will be considered as the first game of the third set. In this way the 
original sequence X decomposed into groups of two or three games 


(X) 101/00)11/00/011]11/00/11j)010]11]|.--. 
leads to a new sequence X” 


(X”’) . 1010110101... 


which is obviously different from X’. Everyone familiar with the usual han- 
dling of the probability concept will say that in X” the probabilities of zeros and 
ones must be the same as in X’. But a mathematical foundation of theory of 
probability, if it deserves this name, has to clear up the question: From what 
principles or particular assumptions and by what inferences may we deduce the 
equality of the limiting frequencies in X’ and X’’? 

There is no difficulty in solving this problem from the point of view of the 
frequency theory. We have only to apply somewhat different place selections 
instead of the above used which lead to the kollektivs Xi, X2, X3;. I showed 
elsewhere how the general set-of-games problem can be satisfactorily treated in 
this way. Here I want to stress only that the problem as a whole is completely 
inaccessible by any of the other known approaches to probability theory. The 
classical point of view which starts with the notion of equally likely cases and 
rests upon a rather vague idea of the relationship between probability and 
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sequences of events does not even allow the formulation of the problem. In 
the so called modernized classical theory, as proposed by Fréchet, probabilities 
are defined as “physical magnitudes of which frequencies are measures.” 
Fréchet would say that the frequencies both in X’ and in X” are measures of 
the same quantity. But why? We face here obviously a mathematical ques- 
tion which cannot be settled by referring to physical facts. It is clear that the 
equality of the distributions in the two sequences X’ and X”’ is due to the 
randomness or irregularity of the original sequence X. No theory which does 
not take in account the randomness, which avoids referring to this essential 
property of the sequences dealt with in probability problems, can contribute 
anything toward the solution of our question. 

I have to make some special remarks about the so-called measure theory of 
probability.” 

























5. Probability as measure. Up to now we have been concerned only with 
the simplest type of kollektivs, namely, with those sequences the elements of 
which belong to a finite set of numbers so as to have a distribution consisting 
of a finite number of finite probabilities with the sum 1. It may be true that 
all practical problems, in a certain sense, fall into this range. For, the single 
result of an observation is always an integer, the number of smallest units 
accessible to the actual method of measuring. Nevertheless in many cases it 
is much more useful to adopt the point of view that the possible outcomes of an 
experiment belong to a more general set of numbers, e.g. to a continuous segment 
or any infinite variety. If we include the case of kollektivs of more than one 
dimension, we have to consider a point set in a k-dimensional space (where 
even k may be infinite) as the label set or attribute set of the kollektiv. In 
order to define the probability in this case we have to choose a subset A of the 
label set and to count among the first n elements the number n, of those elements 
the attributes of which fall into A. Then the quotient n, : n is the frequency, 
and its limiting value for n infinite will be called the probability of the attribute 
falling into A within the given kollektiv. 

It was rightly stressed by many authors that in the case of an infinite label set 
some additional restrictions must be introduced. In particular A. Kolmogoroff 
set up a complete system of such restrictions. We cannot ask for the exist- 
ence of the limiting frequency in any arbitrary subset A. It will be sufficient 
to assume that the limit exists for a certain K6rper or a certain additive family 
of subsets. If it exists for two mutually exclusive subsets A and B, the limit 
corresponding to A + B will be, by virtue of the original definition, the sum of 
the limits connected with A and B. We can now insert a further axiom involving 
the complete additivity of the limiting values. So we arrive at the statement 





? What I call measure theory here is essentially that proposed by Kolmogoroff in his 
pamphlet of 1933. As to the new theory developed by Doob in his following paper (where 
instead of the label space the space of all logically possible sequences is used in establishing 
the measures) see my comment on page 215. 
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that probability is the measure of a set. All axioms of Kolmogoroff can be 
accepted within the framework of our theory as a part of it, but in no way asa 
substitute for the foregoing definition of probability. 

Occasionally the expression probability as measure theory is used in a dif- 
ferent sense. One tries to base the whole theory on the special notion of a set 
of measure zero. One of the basic assumptions in my theory is that in the 
sequence of results we obtain in tossing a so called correct die the frequency, 
say of the point 6, has a certain limiting value which equals 1/6. A different 
conception consists in stating that anything can happen in the long run with a 
correct die, even that an uninterrupted sequence of six’s or an alternating se- 
quence of two’s and four’s or so on may appear. Only all these events which 
do not lead to the limiting frequency 1/6 form, together as a whole, a set of 
events of measure zero. Instead of my assumption: the limiting value is 1/6 
we should have to state: It is almost certain that a limit exists and equals 1/6. 
Nothing can be said against such an alluring assumption from an empirical 
standpoint, since actual experience extends in no case to an infinite range of 
observations. The only question is whether the asumption is compatible with 
a complete and consistent theory. I cannot see how this may be achieved. 
Before saying that a set has measure zero we have to introduce a measure system 
which can be done in innumerable ways. If e.g. we denote the outcome six by a 
one and all other outcomes 1 to 5 by zero, we get as the result of the game with 
a die an infinite sequence of zeros and ones. It has been shown by Borel that 
according to a common measure system the set of all 0, 1 sequences which do not 
have the limiting frequency } has the measure zero. In this way it turns out 
to be almost certain that the limiting frequency of the outcome six in the case 
of a correct die is 3. Other values for the limit can be obtained by a similar 
inference. It is a correct but misleading idea that the measure zero is unaffected 
by a regular (continuous) transformation of the assumed measure system, since 
in our field of problems different measures which are not obtained from one 
another by a regular transformation have equal rights. So, saying that a certain 
set has the measure zero makes in our case no more sense than to state that an 
unknown length equals 3 without indicating the employed unit. 

In recapitulating this paragraph I may say: First, the axioms of Kolmogoroff 
are concerned with the distribution function within one kollektiv and are 
supplementary to my theory, not a substitute for it. Second, using the notion of 
measure zero in an absolute way without reference to the arbitrarily assumed 
measure system, leads to essential inconsistencies. 


6. Statistical estimation. Let me‘now turn to the last point, the application 
of probability theory to one of the most widely discussed questions in today’s 
statistical research: the so-called estimation preblem. Many strongly divergent 
opinions are facing each other here. I think that the probability theory based 
on the notion of kollektiv is best able to settle the dispute and to clear up the 
difficulties which arose in the controversies of different writers. 
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We may, without loss of generality, restrict ourselves to the simplest case 
of a single statistical variable zx and a single parameter #, where z of course may 
be the arithmetical mean of n observed values. Here (and likewise in the case 
of more variables and more parameters) we have to distinguish carefully among 
four different kollektivs which are simultaneously involved in the problem. 
The range within which both z and # vary will be assumed to be a continuous 
interval so that all distributions will be given by probability densities. 

The first kollektiv we deal with is a one-dimensional one where the probability 
of x falling into the interval z, x + dz depends on z and on a parameter #. If 


(1) p(z | 8) 


denotes the corresponding density and the limits A, B within which z possibly 
falls depend on # too, we have 


B(o) 
(1’) | p(x | 8) dx = 1 for each #. 
A(#) 


In order to fix the ideas we may imagine that the first kollektiv consists in 
drawing a number z out of an urn and that # characterizes the contents of the 
ur. Asking for an estimate of 3 implies the assumption that different possible 
urns are at our reach every one of which can be used for drawing the z. The 3 
values for the different urns fall into a certain interval C, D. It is usual to sup- 
pose that the urns are picked out at random so as to give another one-dimensional 
kollektiv with the independent variable 3. Let po(#) dd be the probability of 
picking an urn with the characteristic value failing into the interval 3, 8 + dv. 
This density 


(2) po(?) 


is often called the prior or a priori probability of 3. As the range within which 
8 varies is confined by the constants C and D, we have obviously 


(2') [ po(8) dd = 1. 


Now from these two one-dimensional kollektivs with the variables z in the 
first, 3} in the second, we deduce by combination (multiplication) a two-dimen- 
sional kollektiv with the density function 


(3) P(3, x) = po(d)-p(z | 8). 


The individual experiment which forms the element of this third kollektiv con- 
sists of picking at random an urn and drawing afterwards from this urn. Both 
zand # are now independent variables (attributes of the kollektiv) and it is easy 
to see that it follows from (1) and (2) 


(3’) cf. P(8,2) dx dd = c pod) dd [~ p(x | 3) dx = 1. 


A(8) 
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We will return later to this two-dimensional kollektiv. Let us, first, derive 
from it, by applying the operation of partitioning (Teilung), our fourth and last 
kollektiv which is one-dimensional again. Partitioning means that we drop 
from the sequence of experiments which form the third kollektiv all those for 
which the z-value falls outside a certain interval z, + + dz; and that in this 
way we consider a partial sequence of experiments with only the one variable #. 
The distribution of 3-values within this sequence with quasi-constant z is given, 


according to the well known rule of division or rule of Bayes (a rule which can 
be proved mathematically) by’* 


(4) pid | 2) = — PO) _ = o(z) pole) ple |). 
| P(8, x) dd 

It follows immediately that 

(4’) [ p(8 |x) dd = 1. 


This function p, of # depending on the parameter z is generally called the 
posterior or a posteriori probability of 3. 

If p,(3 | z) can be computed according to the formula (4), every question con- 
cerning the “‘presumable”’ value of 3 as drawn from the outcome z of an ex- 
periment is completely answered. We can find indeed, by integration the 
probability which corresponds to any part of the interval C, D of 3 and so the 
estimation problem is definitely solved. But the trouble is that in most cases of 
practical application nothing or almost nothing is known about the prior prob- 
ability ~o(%) which appears as a factor in the expression of p,. Hence arises 
the new question: What can we say about the 3-values without having any informa- 
tion about its prior probability? This is the estimation problem as it is generally 
conceived today. 

The first successful approach to the answering of this question was made by 
Gauss. If we do not know p; , we know however, except for a constant factor, 
the quotient p:/p., posterior probability to prior probability which equals 
cp(z|#). The maximum of this quotient must be greater than one, since the 
average values of both po and p; are the same. So the maximum means the 
point of the greatest increase produced by the observed experimental value of z 
upon the probability of 3. It seems reasonable to assume the 3-value for which 
the ratio p;/po reaches its maximum as an estimate for #: It is the value upon 
which the greatest emphasis is conferred by the observation. This idea, orig- 
inally proposed by Gauss in his theory of errors, has been later developed chiefly 
by R. A. Fisher, and is known today as the maximum likelihood method. Calling 
the ratio p:/po likelihood seems indeed an adequate nomenclature. 


3 For brevity Bayes’ rule is employed in the text as in the case of a discontinuous dis- 
tribution. The correct procedure in the case of a continuous z would require that we first 
use finite intervals and then pass to the limit. 
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The method of estimation used most frequently today is not the maximum 
likelihood method, but the so called confidence interval method, inaugurated 
by R. A. Fisher and now successfully extended and applied by J. Neyman. This 
method uses the third of the above mentioned kollektivs instead of the fourth, 
i.e. the two-dimensional probability P(#, x). At first sight it seems hopeless 
to use this function which includes the unknown prior probability po(#) as a 
factor. But it turns out as Neyman has shown‘ (and this is the decisive idea 
of the confidence interval method) that we can indicate in the z, 3-plane special 
regions for which the probability ff P(, x) dx dé is independent of po(#). In 
fact, if we point out for every #8 such an interval 2; , 72 as to have 


z9q(0) 
(5) a, p(x | 8) dx = a, 0<a <1, 
21 


it follows immediately from (2) and (5) for the region covered by these intervals 


© [f"” Pe,2)aeas = J” pao) ao [ vi pa |8) dr = a. 


21 (8) 
For given a the intervals can be chosen in different ways. If we choose z; = A 
for } = C and z, = B for # = D, we get a strip or belt, as shown in Fig. 1 
which supplies for every given z a smallest value 3; and a greatest value #. 
The definition of our third kollektiv leads to the conclusion: If we predict each 
time a certain x is observed that 3 lies between the corresponding 3, and 82, then 
the probability is a that we are right, whatever the prior probability may be.’ It is 


‘J. Neyman, Roy. Stat. Soc. Jour., Vol. 97 (1934), pp. 590-92. 

* After my lecture Dr. A. Wald called my attention to Neyman’s suggestion; namely 
that this statement can be generalized by admitting that the infinite sequence of 3-values 
which results from picking out successively the urns for drawing a number z, does not 
fulfill the conditions of a kollektiv. So, instead of the terms ‘“‘whatever the prior prob- 
ability may be’’ we can say ‘“‘whatever the method of picking out the urns may be.’’ In 
fact, let us consider the case where 3 can assume only a finite number of values 3; , 32, --- 
%. Among the n first trials let ng be the number of cases where ¢ = 3, and n, S< nz the 
number of cases where 8 = 3, and z falls into the interval 2:(0,), z2(8,). The relative 
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understood that in this argument both z and # are variables the values of which 
may change from one trial to the next. I cannot agree with the statement, 
which is often made, that x only is a variable and 3 a constant or that we are 
only interested in one specified value of 3. In no way is it possible, in the 
framework of the confidence limits method, to avoid the idea of a so-called 
superpopulation, i.e. the existence of a manifold of urns every one of which forms 
a kollektiv.6 Thus no contradiction and no antagonism exists between this 
method and the Bayes formula. Only a different kollektiv, a two-dimensional 
instead of a one-dimensional, is here considered. 

I have no time to enter here in a discussion of the very interesting develop- 
ments of Neyman’s theory which are intended to supply additional conditions 
in order to determine the arbitrary choice of the z-intervals in a unique way. 
May I only mention that what is called in Neyman’s theory the probability of a 
second type error in testing the hypothesis # = #% is given by the expression 


D pre(?o) D zq(Bo) 
(7) / P(8, x) dx dd = | pod) dd / p(x | 3) dz. 
¢ z Cc 


1(8 9) 21(8o) 


If we want to determine the confidence belt or the intervals 2; , x2 in such a way 
as to minimize this expression independently of the function po(?), we obtain 
Neyman’s maximum power condition 


z2(8q) 
(8) p(x | #) dr = F(#, 3%) = min. for each pair #, do. 
x (Bo) 


This condition, it is well known, cannot be fulfilled under general assumptions 
for p(x|#). Moreover the above-mentioned boundary conditions 2,(C) = 
A(C) and 22(D) = B(D) (or similar ones in other cases) have to be considered 
too. If they are not satisfied, the statement which can be made with probability 
a would include the prediction that certain z-values are impossible. Except 
for this case the above formulated theorem is equally valid for every region 
determined according to (5). 

It is clear that if the original distribution is given by a regular, slightly vary- 
ing function p(z | #), the confidence limits method cannot give very substantial 
results. Let us take e.g. for p(x | #) the uniform distribution 


(9) p(x| 3) = 1/8 for0 Sz SB, oa? s 1. 


meena eee ee 


frequency of correct predictions is then (n, + nm, ++: nm): nm where n equals n; + n2 + 

- mm. Ifn tends to infinity, at least one part of the n, must-become infinite. For those 
the limit of nj:n, tends to according (5) while the other terms (with finite n, and n,) 
have no influence. So the limiting value of the frequency (n; + n, + --- ny): n equals 
inanyeventa. This generalization does not apply, if we ask for the probability of a second 
type error of the hypothesis 8 = #%. Here the existence of the prior probability po is 
essential. 

® According to the generalization supplied by Neyman’s point of view (Phil. Trans. 
Roy. Soc., Vol. A-236 (1937), pp. 333-380) which is discussed in footnote 5, the superpopu- 
lation does not necessarily satisfy the conditions of a kollektiv. 
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We have here A = 0, B = 8, C = 0, D = 1 and the domain in which z and 8 
vary is the 45° right triangle shown in Fig. 2. Whatever po(#) may be, the 
integral of p(#, x) = po(#)-p(x | #) over this domain is 1 and if we omit the 
part of the triangle on the left of the straight line x = (1 — a@)d, the integral 
over the remaining part is a. For a = 0.90, a statement which can be made 
with a probability of 90% reads: The value of # lies between xz and 10z. On 
the other hand we know from the very beginning with 100% certainty that 3 
lies between zx and 1, so that for x = 0.1 the statement is futile. (If one chooses 
as confidence belt the part on the left of the straight line x = ad, the statement 
would run: # lies between 1.1 x and 1 and values of x greater than 0.9 are 
impossible.) If we apply in this case the Bayes formula, we find that the out- 
come depends to the highest extent on what is known about the prior prob- 
ability po(?). 

In most cases however which present themselves in practical statistics the 
original density function p(z | #) has a different character from that assumed in 


x= ({-«) w 


Fic. 2 


(9). It depends generally on an integer n and the distribution is concentrated 
more and more when n increases. (We may define here concentration as 
standard deviation tending towards zero. The integer n means in general the 
number of basic experiments). We have e.g. in the so-called Bayes problem 


where zx is the arithmetical mean of n observations the asymptotic expression 
for p: 


n | en e--0) 2/009) 
(10) p(x | 3) ~ N/ 26 a= 


Osea 0s281. 
If we denote by ® the probability as 


(11) 0s) « = [ im 


the z-intervals corresponding to a given probability value a are defined by 


(12) m= 0 — &, w= dv+eé where o(¢ A/ a. x) = a. 





204 R. VON MISES 


If n has a large value, the é’s are very small and we get a narrow belt along the 
straight line x = # as shown in Fig. 3 for a = 0.90 and n about 100. The 
prediction which can be made with the probability a reads approximately 


(13) x—-nsbsrt+n where ®(9 4/ a4") = 


On the other hand it is well known that in this case the Bayes formula supplies 
a posterior probability p:(8 | x) which turns out to be more and more independent 
of the prior probability po(8) when n increases. It has been shown that the 
asymptotic expression for p:(8| xz) whatever po(#) may be, is 


es n —}n(0—z) 2/z(1—z) 
(14) po |2) ~ 4/5" — , 


It follows that, on the basis of the Bayes formula, we can predict for every 
single value of x with the probability a that #3 lies between the above given 


limits (13). This is more than the confidence limits method supplies, but the 
result is subjected to the restriction that po(#) is a continuous function. How- 
ever, for large values of n (generally this means for large numbers of basic ex- 
periments) the outcomes of both methods are essentially the same. 

Let me recapitulate in three brief sentences the essential results we have 
found in the problem of estimation. 

1. There is no contradiction of any kind between the Bayes formula and the 
confidence limits method and no difference at all in the underlying probability 
concept. In both methods the idea of a sort of “super-population” is used. 
Only two different kollektivs are considered in both cases. 

2. If the original distribution has a regular, slightly varying density function 
p(x | #), the Bayes method gives a complete answer when the prior probability 
is known and no answer when itisunknown. The confidence limits method gives 
in both cases a definite solution; it lies in the nature of things that the solution 
cannot be very substantial if p(z, #) is only slightly varying. 
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3. If the original distribution p(z | #) depends on a further parameter n and 
becomes concentrated more and more with increasing n, both approaches give, 
for large n, asymptotically about the same results. 

It is not intended by these remarks to impair the value of the confidence 
limits method which both from theoretical and from practical point of view 
deserves our attention. But the rather inconceivably aggressive attitude 
towards the Bayes’ theory as displayed by a number of statisticians, which, 
however, does not include J. Neyman, turns out to be completely unfounded. 





PROBABILITY AS MEASURE 
By J. L. Doos 


University of Illinois 


The following pages outline a treatment of probability suitable for statisti- 
cians and for mathematicians working in that field. No attempt will be made 
to develop a theory of probability which does not use numbers for probabilities, 
The theory will be developed in such a way that the classical proofs of proba- 
bility theorems will need no change, although the reasoning used may have a 
sounder mathematical basis. It will be seen that this mathematical basis is 
highly technical, but that, as applied to simple problems, it becomes the set-up 
used by every statistician. The formal and empirical aspects of probability 
will be kept carefully separate. In this way, we hope to avoid the airy flights 
of fancy which distinguish many probability discussions and which are irrelevant 
to the problems actually encountered by either mathematician or statistician. 

We shall identify as Problem I the problem of setting up a formal calculus to 
deal with (probability) numbers. Within this discipline, once set up, the only 
problems will be mathematical. The concepts involved will be ordinary mathe- 
matical ones, constantly used in other fields. The words “probability,” 
“independent,” etc. will be given mathematical meanings, where they are used. 

We shall identify as Problem II the problem of finding a translation of the 
results of the formal calculus which makes them relevant to empirical practice. 
Using this translation, experiments may suggest new mathematical theorems. 
If so, the theorems must be stated in mathematical language, and their validity 
will be independent of the experiments which suggested them. (Of course, if a 
theorem, after translation into practical language, contradicts experience, the 
contradiction will mean that the probability calculus, or the translation, is 
inappropriate.) 

The classical probability investigators did not separate Problems I and II 
carefully, thinking of probability numbers as numbers corresponding to events 
or to hypothetical truths, and always referring the numbers back to their 
physical counterparts. The measure approach te the probability calculus has 
put this approach into abstract form, and separated out the empirical elements, 
thus removing all aspects of Problem II. We shall explain this approach first 
in a simplified set-up, that which will be made to correspond (Problem II) toa 
repeated experiment in which the results of the nth trial can be any integer 2, 
between 1 and N (inclusive), in which the experiments are independent of each 
other, and performed under the same conditions. (The set-up will be applicable, 
for example, to the repeated throwing of a die.) 
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The measure approach treats this experiment as follows. Letw:(z,22,--- ) 
be any sequence of integers between 1 and N, inclusive. We consider w as a 
point in an infinite dimensional space 2. (Each point w may be considered as a 
logically possible sequence of results of the given experiment, and this fact will 
guide us in solving Problem II.) A measure function is defined on certain sets 
of points of 2 as follows. Let pi,---,pw be any numbers satisfying the 
conditions 


P20, G21, Bess t+pv=l. 


(How these numbers are chosen in any particular problem will be explained 
below. The method of choice is irrelevant to the mathematics, but is involved 
in the solution of Problem II.) The set of all sequences beginning with x, = 

is given measure pa. More generally, the measure of the set of all sequences 
beginning with 7; = a, --+,2, = Gn, is defined as pa,-Pa,-+* Pa,- In this 
way, as can be shown,’ a completely additive measure function is determined 
on certain point sets of 2, on a field § of sets so large that all the usual’ Lebesgue 
measure and integration theory is applicable. This means that there is a col- 
lection § of sets of points of 2 such that if S,, S:, --- are finitely or infinitely 

20 0 


many sets in the collection, their sum > S,, their intersection [] S,, and 
1 1 


their complements are also in the collection. Each set S in § has a definite 
measure P(S),0 < P(S) S 1, and if S,, S:, --- are finitely or infinitely many 
disjunct sets in §, 


P(S; + S2 +--+) = P(Si) + P(S:) + ---. 


Problem II, the translation problem, is solved as follows. Each relevant 
event is made to correspond to a point set of 2. A relevant event is a physical 
concept—defined by imposing some set C of conditions on the results of the 
experiments. The corresponding Q-set is the set of sequences (1, 22, --- ) 
satisfying the same set C of conditions, imposed on the z;. Thus the set of all 
sequences beginning with z; = a; , 22 = az, is made to correspond to the event: 
the result of the first experiment is a , of the second is a,. As is to be expected, 
the mathematical picture goes further than the real one. The “event’’ / occurs 
infinitely often in a sequence of trials has only conceptual significance, physically, 
but the corresponding point set of 2: the set of all sequences (2; , 22, --- ) con- 
taining infinitely many 1’s, is a perfectly definite point set whose measure can 
be calculated in terms of pi,---,pw. (In fact it is easily seen that this 
measure is 1 or 0, according as p; > 0 or p; = 0.) By “the probability of an 
event” we shall mean the measure of the corresponding Q-set. As this measure 
has been defined, the probability that the nth trial results in a number j is p; , 
and the probability that one trial results in 7, and another in k, is p;-p . 


1 Cf. A. Kolmogoroff, Ergebnisse der Mathematik, Vol. 2, No. 3, Grundbegriffe der Wahr- 
scheinlichkeitsrechnung, where the most complete treatment of the approach to the proba- 
bility calculus from the standpoint of measure is given. 
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The justification of the above correspondence between events and Q-sets is 
that certain mathematical theorems can be proved, filling out a picture on the 
mathematical side which seems to be an approximation to reality, or rather an 
abstraction of reality, close enough to the real picture to be helpful in prescribing 
practical rules of statistical procedure. The following two theorems are im- 
portant ones, from this point of view. These two theorems depend in no way 
on observed facts. They are stated and proved in the customary language of 
modern analysis. 

TuHEoREM A: Let j, be the number of the first n coordinates of the point 
w:(%,;, 22, --+ ) which are equal to 7, where j is some integer (1 S 7 < N) which 
will be kept fixed throughout the discussion. Then0O S j, S n, and j, varies from 
point to point on 2:7, = jn(w) is a function of w, that is of the sequence (11,22, - - +), 
When n — ©, j,/n has not a unique limit independent of the sequence 
(a1 ,%, +++ ) under consideration. In fact if w is the point (k, k, --- ),ja@) =0 
for all n, unless j = k; if w is the point (j, j, --- ), jn@) = n for all n. Itis 
simple to give examples of sequences w:(21, 22, --- ) for which j,(w) oscillates 
without approaching a limit, as n — ©. But Theorem A (usually called the 
strong law of large numbers) states that there is a set of sequences, i.e. an w-set §, 
of measure 0, such that 


a 


no 1 


unless w is in S. In other words the sequences for which (1) is not true are 


exceptional in the sense of measure theory. If a new choice {p;} of p,’s is made, 
then if p; * p;, the new exceptional set includes all the sequences which were 
not exceptional before, since the limit in (1) becomes p;. Thus S depends 
essentially on p;. Theorem A is a generalization of Bernoulli’s classical theo- 
rem which states in our language that the measure of the set of sequences 
w(%, 22, --- ) for which 


| jnw)/n — pj| > 


approaches 0, as n — ©, for any positive «. Theorem A is stronger because it 
states that there is actual convergence, whereas Bernoulli’s theorem only coa- 
cludes that there is a kind of convergence on the average. 

Theorem A corresponds to certain observed facts, relating to the clustering 
of “success ratios,” giving rise to empirical numbers j;. If the statistician 
wishes to apply his calculus to a given experiment (Problem II), he sets p; = 9;. 
There has been frequent discussion of the problem of determining the jj. 
This discussion of the j; is sometimes held on so high a plane that the innocent 
bystander may wonder to what purpose such abstract philosophic concepts could 
possibly be put—besides that of stimulating further discussion on a still higher 
plane. The principle purpose of this paper is to discuss Problem I, but a few 
words on Problem II might not be out of place here. Almost everyone who is 
going to use probability numbers, the ; , for other than conversational purposes, 
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derives them in the same way. There is a judicious mixture of experiments 
with reason founded on theory and experience. Thus if a coin is tossed by an 
experimenter who has examined the coin, and found that it had heads on one 
side but not on »oth, that it seemed balanced, and that (as a confirming check) 
tossing a hundred times gave around 50 heads, the experimenter would use } 
as the probability of obtaining heads in his further reasoning. Of course there 
is no logic compelling this. The experimenter may have been fooled. A coin 
far out of balance may turn up 50 heads in 100 throws. But man must act, 
and the above procedure has been found useful, which is all that is desired. In 
many experimen ‘.;, less reliance can be placed on a preliminary physical examina- 
tion of the experimental conditions, and more must be placed on the actual 
working out of the experiment, as in the analysis of machine products. In that 
case, the actual results must be examined with great care, before attempting 
to use the above mathematical set-up. It sometimes may even be possible to 
change the experimental conditions to make the mathematics applicable.” In 
all cases, such mathematical theorems as Theorem A and the following Theo- 
rem B give the basis for applying the formal apparatus to practice. Indeed, 
the criterion of application includes the verification of special cases of the prac- 
tical versions of Theorems A and B. 

THEOREM B: Let fn(a1, --+ ,2n-1) (n > 1) be any function of the indicated 
variables, except that wesupposef, only takesonthevalues0,1. Letw:(x1,22,---) 
be a given point of 2. Let n’ be the number of the first n integers 7 such that 










fila: , +++ , 21) = 1, and let 7, be the number of the first n integers 7 such that 
fi(ti, +++, 2%) =1, and z;=j. Then jn , n’ are functions of w:(z;, 22, --- ). 
If fi = fe =--- =1,jn =jn, nm’ = n, where j, is as defined above. Suppose 


that there is an Q-set So of measure 0 such that n’ -> ©, asn — ©, unlessw e¢ S. 
Theorem B states that there is then an Q-set S’ of measure 0, such that if 
w:(%,%2,--- )isnotin S’, 
ff 
jn (w) 
(1’) lim — = pj. 


nao 1 



















(The set S’ will depend on the given functions fi , fo , --- and on the p;, but is 
fixed, once these have been chosen.) This mathematical theorem corresponds 
to certain observed facts (usually summarized by stating that no (successful) 
system of play is possible). In fact, it states, in the language of practice, that 
rejecting certain trials, using as a criterion of acceptance or rejection the results 
of preceding trials, rejecting the 7th trial if f;(a; , --- , zi-1) = 0, does not affect 
the outcome of a game of chance, or, more precisely, does not affect the validity 
of the physical fact corresponding to Theorem A. If f; = fe - = 1, (1’) 
becomes (1). The hypothesis that n’ > © as n — © unless w e Sp is made to 
insure that infinitely many trials will be accepted. As an example of the 


*Cf. W. A. Shewhart, Statistical Method from the Viewpoint of Quality Control, Wash- 
ington, 1939. 
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possible variety in the definition of the f;, we might define f; as 1 if 2,1 = N, 
and f; = 0 otherwise, so trials are accepted only if the previous trial resulted in 
the number N. Or much more complicated systems can easily be devised in 
which the criterion of acceptance of the nth trial depends on a varying number 
of the results of preceding trials. This theorem gives a mathematical counter- 
part to the physical idea of the mutual independence of repeated trials. 

To summarize, mathematically (Problem I) the study has been reduced to 
that of the measure properties of 2. This can be considered independently of 
any physical correspondence. The physical correspondence (Problem II) makes 
any event € correspond to a point set E of Q, the “probability of ©’’ becomes 
the measure of E. Thus “the probability that the result of the first experiment 
is 3’? becomes the measure of the set of sequences (21 , 22, --- ) beginning with 
2, = 3. We have given no sharp definition of probability as a physical concept. 
_If the above mathematical set-up, after translation, using some set of p,’s, 
seems to fit a given physical set-up, any event will be said to have as its proba- 
bility, the measure of the corresponding Q-set. We have attempted to give no 
intrinsic a priori definition of the probability of an event: such a definition is 
quite unnecessary for our purposes. All that was required was a basis for pre- 
scribing the usual statistical procedures, and we have described such a basis. 

In the above example, there would have been no new difficulty introduced 
if the x, were not restricted to integral values, but allowed to take on any 
numerical values. The general point w:(z,, 22, --- ) of 2 would now be any 
sequence of real numbers. Instead of choosing the numbers p,, --- , pw we 


choose a “distribution function” F(z), a monotone function with the following 
properties: 


lim F(x) = 0, lim F(z) = 1, F(z — 0) = F(z). 
Measure on is defined as follows. The set of all sequences beginning with x 
such that a S x, < b is given measure F(b) — F(a). (The number F(b) is 
called “the probability that z,; < b.””) More generally, the measure of the set 
of all sequences (z,, 22,--- ) beginning with ‘x,,---,2,, such that a; S$ 
a; <b;,j = 1,---,nis defined as I [F(b;) — F(a,)]. Thus if F(x) defines a 


simple rectangular distribution: F(z) = 0 forz < 0, F(z) = zfor0 Sz 81, 
F(x) = 1 for z > 1, Q-measure becomes (infinite dimensional) volume in the 
(infinite dimensional) unit cube. The correspondence (Problem II) between 
events and point sets of Q is defined just as before. Sometimes it may be useful, 
in considering experiments giving rise to pairs of numbers, to let each z, be 4 
pair of numbers so that 2 becomes a sequence of points of a plane instead of a 
sequence of points of a line. In all cases there are mathematical theorems 
true of the resulting 2 which guide us (Problem II) in deciding just how the 
‘Q-measure is to be defined, that is, how F(z) is to be defined, in dealing with a 
given practical problem. But the essential point is this. Once 2-measure has 
been defined, no changes or further hypotheses are possible or necessary. All 
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relevant probability questions are answerable. Thus consider a question of the 
following type: if the experiments are grouped in some way,’ with what proba- 
bility will the groups have some given regularity property?" The question singles 
out a set E of sequences of 2 and asks: what is the measure of E? The problem 
may or may not be difficult mathematically,” depending on the grouping, but 
the original definition of measure on © needs no enlargement to answer it. 

Technically, the mathematics has become the mathematics of a special type 
of measure defined on a space of infinitely many dimensions. If, however there 
is an integer v such that only at most »v experiments are to be considered, we 
need only consider the »-dimensional space of points (%,--- ,2,), defining 
measure in this space in the same way as on. Thus if z, has the rectangular 
distribution defined above, the measure in (x; , --- , z,)-space becomes ordinary 
y-dimensional volume in the unit cube. Perhaps the most common measure a 
statistician considers is that in which the measure of an (x, --- , 2,)-set E 
becomes ‘‘the probability that the point (z,,---,2,) representing an inde- 
pendent sample of » from a normal distribution of mean 0 and variance o ” 
will lie in E: 


(2) P{E} =o (2r)” | * | GT PM day «+. die. 


This example makes it obvious that the statistician is always doing measure 
theory, even though he may not state that fact explicitly. If the number of 
experiments has no upper bound conceptually—mathematically when the num- 
ber of dimensions v may increase without limit, as in Theorems A, B, it is much 
more convenient to use the space 2, in terms of which experiments with varying 
numbers of trials can be considered simultaneously. The classical proofs of 
probability theorems, such as Bernoulli’s theorem (the law of large numbers) 
are perfectly correct. If the “probability of an event” is interpreted as the 
measure of a set, these proofs do not even need verbal changes. There can be 
no question of the need for any axiomatic development beyond that necessary 
for measure theory, and the probability calculus can lead to no contradiction, 
unless the theory of measure is faulty. 

It is customary for probability theorists to stop their discussions when the 
present stage is reached, so that the beginnings of a formal calculus have been 
constructed to deal with a repetition of independent experiments, conducted 


+A grouping is necessary, for example, when two players are playing a game in which 
two out of three wins in the trials wina game. The trials are then grouped into successive 
groups of two or three, depending on how they come out. 

‘Continuing the preceding note, the question might be: will the ratio (games won by 
player a)/(games played) approach a limit with probability 1, that is, for all of the original 
sequences {z,} except possibly some forming a set of measure 0? 

* The answer to the question of the preceding notes is simple. If p is the probability 
that player a wins a trial, the ratio in question approaches p* + 3p*(1 — p), the probability 
that a wins a game, with probability 1. 
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under the same conditions. Perhaps this is because of the following widely held 
syllogism: probability is something dealing with random events; random events 
are events having no influence on each other; therefore.... Unfortunately 
mathematicians and statisticians must deal with many problems involving de- 
pendent probabilities, whose solutions require the most delicate and careful 
applications of modern analysis. The rudimentary calculi which the outsiders 
find esthetically or philosophically pleasing are usually either insufferably awk- 
. ward or completely insufficient for the needs of professionals. There is a strange 
situation, which one observer has facetiously described somewhat as follows: it 
is true with probability 1 that the technical workers in probability use the 
measure approach, but that the writers on ‘“‘probability in general’’ descendants 
of Carlyle’s professor, do not consider this approach worth much more than a 
passing remark.® The following pages outline how our previous treatment is 
generalized to deal with problems in which it is desirable to have the distribution 
of xz; vary with 7 (so that physically the experiments are no longer the same), 
and in which the z; do not have to correspond to the results of independent 
experiments. Some attempt will also be made to show how the modern mathe- 
matical theory of real functions is applied to the probability calculus. 

Let x; = x;(w) be the jth coordinate of the point w:(z;, t2,---). Then as 
the sequence w: (21 , 22, --- ) varies, x; does also: 2;(w) is a function of w. The 
functions 2;(w), 22(w), --- are functions defined on Q, an abstract space on which 
a measure has been defined. Moreover 2-measure has been defined in such a 
way that the Q-set for which z;(w) < K (j, K fixed) is an Q-set whose measure 
has been defined. (This set is composed of all sequences (2; , 22, --- ) whose 
jth coordinate is <K, and the measure is F(K), using our last definition of 
Q-measure.) In the terminology of measure theory, x;(w) is thus a measurable 
function. The study of the measure relations of 2, and this is the whole of our 
probability calculus, can be considered, from this point of view, as the study of 
the properties of a sequence of measurable functions, one with very special 
properties, as we shall see, defined on some space. A measurable function 
defined on © is usually called a chance variable, in the theory of probability. 
(This terminology is somewhat dangerous, because it mixes Problems I and II.) 
The whole apparatus of modern real variable theory is applicable to these 
chance variables. Thus if f(w) is a chance variable (measurable function of w) 
(physically, a function of the observations), it is customary to define a number 
called its expectation. This number is simply the integral of fw), with respect 
to the given 2-measure. The fact that the expectation of the sum of two chance 
variables is the sum of their expectations is simply the familiar theorem that the 
integral of the sum of two functions is the sum of their integrals. Let S(j, K) 
be the 0-set defined by the inequality z; < K. Up to now we have supposed 
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* This analysis, like every other probability statement, is only an approximation to 
reality, but a fairly close one. 
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that the measure of S(j, K) is independent of j, that is that the distribution of z; 
is independent of 7. We have also supposed that? 


for any positive integer n, and numbers K,,---,K,. That is, we have sup- 
posed that 21(w), ze(w), --- are mutually independent chance variables.’ In 
fact probability measure on 2 has been defined just to make the foregoing two 
facts true. Mutual independence is a very strong hypothesis to impose on a 
sequence of functions. In many probability problems (Markoff chains for 
example), more general measures must be defined on 2. The sequence 2;(w), 
a:(w), --- whose properties are those of 2-measure, is then no longer a sequence 
of independent functions, and the distribution of z; can vary with j. 

At this level, the study becomes the study of any sequence of measurable 
functions, defined on some space of total measure l. If f, g are given chance 
variables, they may turn out to be independent. In that case the theorem that 
the expectation of their product is the product of their expectations becomes, 
when translated into mathematical language, the familiar theorem that 


| | S(z)g(y) dz dy = | f(x) dx | g(y) dy. 


The mathematical theorems are not simply analogues of the probability theo- 
rems—they themselves are those theorems. When stated mathematically, the 
probability theorems need no proof: they need only recognition as standard 


results. 

Empirical needs suggest that certain functions called conditional probability 
distributions, and conditional expectations, should be defined in a certain way. 
This is possible, as a formal matter,’ and the theorems then proved about these 
functions gives them their usual meaning when translated into practical language. 
These functions are extremely useful tools in dealing with mutually dependent 
(that is not independent) chance variables. 

The above approach is easily generalized to the stage needed in the study of 
Brownian movements or of time series, in which, instead of the proper initial 


7 P{S} was defined as the measure of the 0-set S. 


* The n chance variables f:(w), f2(w), --- , fn(w) are said to be independent if for every 
set of n numbers K,, --- , K,, the following equality is true. 


P{ fi(o) < K;, j =1, se, mn} = IT Pi f;(o) < Kj}, 


where P{ --- } denotes the Q-measure of the 0-set defined by the conditions in the braces. 
Thus in the example of a normal distribution in » dimensions given above, 21, --- , zv 
are independent functions on the space of » dimensions, a fact which follows readily from 


the fact that the »-dimensional density function is the product of » functions of the separate 
variables. 


* Cf. Kolmogoroff, loc. cit. 
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abstraction being a sequence {z,} of numbers, we have a one-parameter family 
{z,} (¢ takes on all real values). The number z, may, for example, be thought 
of as the z-coordinate of a particle at time ¢. There is no difference in principle 
here: Q is now the space of functions of ¢, instead of the space of sequences, that 
is functions of n. From the other point of view, instead of studying the proper- 
ties of a sequence of measurable functions, it becomes necessary to study the 
properties of a one-parameter family of measurable functions. 





DISCUSSION OF PAPERS ON PROBABILITY THEORY 
By R. von Mises anv J. L. Doos 


1. Comments by R. von Mises. Professor Doob outlines a new theory of 
probability starting with the following three basic conceptions. First, he uses 
the notion of an infinite sequence of trials or better: of an infinite sequence of 
numbers 21, %2, 23, --- which can be considered as the outcomes of infinitely 
repeated uniform experiments. Second, he introduces (in his Theorem A) the 
limit of the relative frequency of a particular outcome a. Third, (in his Theo- 
rem B) the notion of place selection defined by a sequence of functions 
fa(t1, %2, +++ 2n4) is employed. All these three concepts are completely 
strange to the so called classical theory as developed by Bernoulli, Laplace, 
Poisson, etc. They have been introduced and made the corner stone of proba- 
bility theory in my papers published since 1919. I daresay that in no probability 
investigation before 1919 any of those notions even were mentioned. 

This concerns what Professor Doob calls the Problem I or the purely mathe- 
matical aspect of the question. As to his Problem II or the relationship between 
the formal calculus and real facts Professor Doob stresses that the actual values 
for probabilities that enter as data into a particular argument have to be drawn 
from long, finite sequences of experiments. This is in complete accordance 
with the standpoint of my theory and in strict contradiction to the classical 
conception which knows only “a priori” probabilities determined by “equally 
likely cases.” 

In both theories, Professor Doob’s and mine (not in the classical) a mathe- 
thematical model or picture is associated with a long sequence of uniform 
experiments. These models are different in both theories. My model (the 
“kollektiv’”’) consists of one infinite sequence w: 21, Z2, %3, +--+ in which the 
limit of the relative frequency of each possible outcome a exists and is indifferent 
to a place selection; the value ef this limit is called the probability of a. , 

On the other hand Professor Doob’s model implies all logically possible se- 
quences which form a space 2 and he shows that in this space a measure function 
can be introduced which fulfills the following conditions: (1) If m is a positive 
integer, the set of all sequences the mth element of which is a has a measure Pa 
independent of m; (2) the set of all sequences in which the relative frequency 
of a-results has either no limit or a limit different from p,. is zero; (3) if S is any 
place selection, the set of all sequences w for which the relative frequency of a 
in S(w) has either no limit or a limit different from pa is likewise zero; this value 
Pa is called the probability of the outcome a. It then can be shown that a 
probability in this sense can be ascribed to certain events, i.e. to certain types 
of experiments which in some way are connected with the sequence of basic 
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experiments. E.g. if the original sequence consists of the single successive 
tossings of a die, the derived sequence may consist of pairs of tossings with the 
sum of the outcoming points as new value of a. The new probabilities p, are 
found as measures of certain sets in the original measure system established in Q, 

There is no doubt that the model used by Professor Doob for representing 
empirical sequences of uniform experiments is logically consistent. Its practical 
usefulness depends on how the usual problems of combining different kollektivs 
and so on can be settled within this scheme. This has to be shown in detail. 
It seems to me that my conception is simpler in its application and closer to 
reality, while his model may be considered more satisfactory from a logical 
standpoint since it avoids the difficulties connected with the concept of “all 
place selections.”” At any rate, however, there is no contradiction or irrecon- 
cilable contrast: both theories are essentially statistical or frequency theories, 
equally far from the classical conception based on “equally likely cases.” In 
both theories probabilities are, of course, measures of sets. 


2. Comments by J. L. Doob. It is perhaps unfortunate that Professor von 
Mises’ treatment of probability problems, based on typical sequences (“‘collec- 
tives,” “admissible numbers’’), is commonly called the “frequency theory.”” It 
is clear to any reader of our papers (identified as M and D below) that the idea 
of frequency, at least in the discussion of the relation of mathematics to prac- 
tice, is no more fundamental to one approach than to the other. In one mathe- 
matical treatment frequency notions first appear in the theorems, whereas in 
the other they first appear in the axioms; but they appearin both. The principal 
objection the measure advocates have to the frequency approach is that it is 
awkward mathematically. Anyone who doubts this awkwardness need only 
examine various books published recently, using this approach, to see what a 
lot of fussy detail is involved merely in proving such elementary results as the 
Tchebycheff inequality or the Bernoulli theorem. One author considers it neces- 
sary to have his chance variables so restricted that if x is a chance variable, the 
event x < k has a probability assigned to it only if k is not in some exceptional 
set, which may be infinite. To take another example, consider the coin tossing 
game discussed in both M and D, in which two out of three wins at tosses win 
agame. Apparently the probability analysis of this game is somewhat difficult 
in terms of the frequency theory. As the quite elementary treatment outlined 
in D shows, there is no difficulty involved, using the measure approach. The 
question is simple: a set of chance variables is given (corresponding to the 
original tosses); a new set is determined from them (corresponding to the 
grouping into games). Only elementary algebraic manipulation is required to 
verify that the new chance variables are mutually independent in the mathe- 
matical sense, (Cf. D), and have the same distribution, so the law of large 
numbers is applicable. Professor von Mises considers that the measure theory 
cannot handle this problem. I on the other hand consider that this problem 
exhibits the mathematical disadvantages of the frequency theory. 


1 This identifying name will be used below also. 
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The frequency theory reduces everything to the study of sequences of mutually 
independent chance variables, having a common distribution. “Probability 
theory is the study of the transformations of admissible numbers” writes Pro- 
fessor von Mises. This point of view is extremely narrow. Many problems of 
probability, say those involved in time series, can only be reduced in a most 
artificial way to the study of a sequence of mutually independent chance vari- 
ables, and the actual study is not helped by this reduction, which is merely a 
tour de force. 

It is claimed in M that the axioms of measure theory only describe the distri- 
bution within one collective (M, p. 00). This statement seems to mean that 
only the measure relations (using the notation of D) of the first coordinate 
function 2;(w) can be discussed in the measure theory, that is only probabilities 
of the type: the probability that z, < k (in the language of practice, ‘the 
probability that the result of the first experiment is less than k’’) are discussed. 
Actually, however, (Cf. D) the measure theory can discuss any number of ex- 
periments simultaneously, using the appropriate space Q. 

Many of the debates between the advocates of the various probability theories 
have been wasted, because some of the debaters talk mathematics, others physics. 
With this in mind, I should like to stress again’ that (except for a few philo- 
sophically inclined Englishmen) everyone calculates probability numbers in the 
same way—a combination of reasoning based on experience and helped by 
theory, with examination of the experimental conditions and the results of trials. 
Frequency considerations necessarily play a large part. The fact that almost 
everyone calculates probability numbers in the same way does not alter the 
fact that one mathematical theory may be more useful or convenient than 
another in dealing with these probability numbers. 

In closing, it seems proper to call attention to what the measure advocates 
consider the real services and contributions of the approach of Professor von 
Mises. Professor von Mises was the first to stress the importance of the second 
of two fundamental generalizations of experience in dealing with repeated mu- 
tually independent experiments of the same character: (1) the clustering of 
success ratios and (2) the fact that this clustering is unaffected by a system of 
rejection as described in M and D. These two generalizations of experience are 
certainly fundamental. The only point under discussion here is how such gen- 
eralizations are to be put into a mathematical setting. The original such setting 
of Professor von Mises was criticized as not really mathematical. The setting 
now proposed by Copeland and others is criticized by the measure advocates as 
mathematically inflexible and clumsy. But it is significant that even in a treat- 
ment of the measure approach, as in D, it was felt essential to stress the mathe- 
matical interpretation of the two empirical generalizations of Professor von 
Mises. In the terminology of D, the measure advocates consider the contribu- 
tion of Professor von Mises’ approach to be a contribution to a solution of 
Problem II, not to Problem I, the mathematical problem. 


? We are not talking mathematics now, but the application of mathematics. 





CONTINUED FRACTIONS FOR THE INCOMPLETE BETA FUNCTION’ 


By Leo A. AROIAN 
Hunter College 


1. Introduction. Existing literature on the problem of calculating the in- 
complete Beta function 


(1.1) B.(p, q) = I a?"(1 — x)*" dz, 0<2<1,p>0,q>0, 
0 


and the levels of significance of Fisher’s z{1] leave further work to be done. 
Miiller’s continued fraction and a new continued fraction are shown to possess 
complementary features covering the range of B.(p, qg) for all values of z, p, q. 
Previous methods of computing /.(p, q) = B.(p, q)/B(p, g) are given in [2], [5], 
(6), [8], [10], [13], [14], [15). 

Miiller’s continued fraction is 
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A convergent infinite series 1 + > d,x” can be converted into an infinite con- 
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- where [4], [9] p. 304, 
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1 Presented at a meeting of the American Mathematical Society, October 28, 1939, New 
York City. 
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The infinite continued fraction found in this manner is called the corresponding 


continued fraction and the power series is said to be semi-normal if B2, ~ 0, 
Ber * 0. 


2. A new continued fraction. Miiller found his continued fraction by con- 
verting in the manner of the preceding paragraph 
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where C, = c,x. By well known theorems due to Van Vleck [12] and Perron 
[9] p. 347 we find (1.2) converges for —1 < x < , and (2.3) converges for 
—« <2 < 1, and in the neighborhood of zero (2.2) equals (2.3). The region 
of equivalence of the series and the fraction may be extended by the following 
argument. Let the infinite series be terminated at some arbitrary point which 
gives the desired accuracy. Then the continued fraction of the corresponding 
type represents this finite series, is finite and gives the result within the desired 
accuracy. The new continued fraction may also be derived by use of the hyper- 
geometric series [9] p. 348. A special case of (2.3) was given by Markoff [3], 
pp. 135-41, [11] pp. 53-55, who applied the result only to the binomial distribu- 
tion. The associated continued fraction provides more rapid convergence than 
the corresponding continued fraction. The associated continued fraction is 
found by means of the hypergeometric series [9] p. 331, p. 348: 
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(p + 28)(p + 2s + 1) (p + 28 + 2)(p + 28 + 1) 


The disadvantage of (2.4) lies in the unwieldy form of computation. For prop- 
erties of an associated continued fraction and the corresponding continued 
fraction in connection with convergence and the Taylor series reference is made 
to [9] p. 331 and pp. 302-303. 


3. Properties of the corresponding continued fraction. Miiller and Soper 
[5], [10], pointed out the inadvisability of integration through the mode z = 
p-l 
p+q-2 
for his continued fraction that if we do not integrate through the mode (we 
assume this in the remainder of the paragraph) that convergents 2, 3, 6, 7, etc., 
will be greater than the true value and the remaining convergents will be less 
than the true value provided q is an integer. However, if qg is not an integer, 
and is small (q¢ < 20), it may happen that all convergents are above the true 
value. In such cases we may consider whether Miiller’s continued fraction may 
apply by estimating the remainder I(p + s, g — 8), after s reductions by parts 

[10]. 
For the new continued fraction also 


In such cases we change I,(p, q) to J:-:(q, p). Miiller has shown 
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and Ce4: < 0; C2, > 0 unless s > q when C2, < 0. If C2, > 0 then the con- 
vergents 2, 3, 6, 7, 10, 11, etc., will be above the true value and the other con- 
vergents will be below the true value. If C2, < 0, then all convergents will be 
above the true value. In such cases, since a remainder for the continued frac- 
tion has not been found, it seems best to estimate J.(p + 8s, gq — s) to obtain 
an idea of the error. 


4. I.(p + 8, gq — s) and the equivalent continued fraction. Soper [10] has 
given the remainder after s reductions by raising p. This will furnish an upper 
bound of the error in the corresponding continued fraction after s convergents. 
The remainder, when g — s is a negative integer, is approximately 


I.(p + 8,q — 8) 


bee 


Another approach is to use the equivalent continued fraction, for s — 1 con- 
vergents of the equivalent continued fraction reproduces exactly s terms of the 
infinite series. The infinite series and the equivalent continued fraction for the 
infinite series are alike in all respects except form. By [9] p. 210, we find that 
the equivalent continued fraction for (2.3) is 


where § = 
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The equivalent continued fraction for Miiller’s continued fraction is given in 
[5], p. 292. 


5. Numerical illustration. If A, and B, represent the numerator and the 
a & & & 
bit bet bs bot 


denominator of the v-th convergent of a continued fraction — 
then’ 
Ay = byAv-1 + Q,Ay_2 


(5.1) 
B, = b, By + a,By_2 ? 
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As an example we calculate I,, (2.5, 1.5), which could not be done by Miller’s 
continued fraction. 





Convergent A B A/B 









1 1 1 ] 

2 1 -42857143 2 .3333333 
3 1 .015873016 .44444444 2. 2857142 
t . 66233767 . 29292929 2.2610838. 
5 . 64812966 . 28671329 2. 2605498 


















6 .46471308 . 20559441 2.2603391 
7 .441837914 . 195475117 2. 2603281 
8 . 33105492 . 14646345 2. 2603245 
9 . 30890766 . 13666520 2 . 2603242 
10 . 23762461 . 10512856 2. 2603240 
11 . 21882154 .096809808 2. 2603240 










Using the value of the eleventh convergent we have, J.5 (2.5, 1.5) = .28779339. 
Pearson [7], p. 30, gives .2877934 and Soper [10], p. 32 gives .28779341. 





















6. Discussion of the various methods. Miiller’s continued fraction encounters 
difficulties when q is small due to the possible divergence of the series on which 
it is based. In such cases the new continued fraction works admirably. Where 
“reduction by parts’ [10] is advisable it would seem Miiller’s results will be 

better, while if “integration raising p’’ is preferable, then the new continued 

fraction would be necessary. The other methods suggested in the past lacked 
in some cases remainder terms; were in other cases too long; were feasible only 
in a limited range; or were only approximations. I am particularly indebted 
to Professor C. C. Craig under whose guidance this study was completed. 
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NOTES 


This section is devoted to brief research and exposttory articles, notes on methodology 
and other short items. 


NOTE ON THE DISTRIBUTION OF NON-CENTRAL t 
WITH AN APPLICATION 


By Cecit C. CraieG 
University of Michigan 


If we adopt the notation recently used by N. L. Johnson and B. L. Welch [1], 
non-central ¢ is defined by 


z+6 
t= ale ; 
in which 6 is a constant and z and w are independent variables, z being distributed 
normally about zero with unit variance and w being distributed as x’/f in which f 
is the number of degrees of freedom for x’. 

In the paper referred to Johnson and Welch discuss some applications of 
non-central ¢ and give suitable tables calculated from the probability integral 
of the distribution of this variable. Previously tables of this probability in- 
tegral for the purpose of calculating the power of the ¢ test had been given by 
J. Neyman [2] and Neyman and B. Tokarska [3]. 

It is the purpose of this note to call attention.to a series expansion for the 
probability integral of non-central t which is simple in form and in most cases 
convenient for direct calculation. Asan application of some intrinsic interest 
this series is used to compute in several numerical cases the power of a test 
proposed by E. J. G. Pitman [4] based on the randomization principle. 

If for convenience we write, 


Vw=¥,0<¥< ), 
we have for the joint distribution of z + 6 and y, 


{ I/2 
(1) df(z+ 6,y) = aaa ete) YI ay de. 


From this 
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Now this series can be integrated term by term with respect to y over its range 
and we have, 


112 p—52/2 4(f+r+1) 
@ a= erg AT wo (Ga) 


This series converges uniformly in any finite interval for ¢ and it may be inte- 
grated term by term over the entire range for ¢ or over any part of it. In 
particular, after some reduction, we get, 


PO <t<blf,s) = = 


(4) — ie 
_@/2)" | 


in which J G + 1)/2, f/2; j > 2 4) is the incomplete Beta-function in the nota- 


tion of Karl Pearson. Often what is wanted is 


6)  P(-wstsw =e ECM 1 (64 na,sa;-Ba), 


Since the incomplete Beta-function is numerically less than unity it is seen 
that the series (4) or.(5) converges rapidly for moderate values of 6 such as will 
ordinarily occur in applications for small samples. The use of Pearson’s tables 
oi I(p, g; x) will be convenient’since interpolation will be required for only one 
of the three arguments. 

As an application let us consider the test proposed by Pitman in the paper 
referred to above. Two independent samples, 7,22, --- ,2%w,,andyi,y2,---, 
yx, , have been drawn and it is desired in the absence of any information about 
the two populations from which the samples came to test the hypothesis that 
they have equal means. A test based on what may be termed the principle oi 
randomization for this situation has been discussed by R. A. Fisher [5] and by 
E. S. Pearson [6]. It is as follows: Let the combined sample of N; + Ne ob- 
servations be separated into sets of Ni observations, wu, uw, ---,uUn,, and Nez 
observations, v1 , v2, +--+ ,Uw,, in all possible ways. For each such separation 
let the numerical difference of the means, | a — #|, be the spread. Then for a 
suitably chosen 6 > 0, we will reject the hypothesis of equal means if fewer than 
100a% of the w,+~,C nw, spreads exceed | — |, and otherwise not. It is clear 
that this test is fiducially valid independently of the populations actually sampled 
in the sense that if it be consistently followed for all such samples, the proportion 
of cases when the hypothesis is rejected when it is true will statistically ap- 
proach a. 

For all but very small samples it is very tedious to calculate the y,4»,Cx, 
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spreads and Pitman in his discussion shows that for quite moderate values of 
N, and Nz the quantity, 


NiN:2 
_ (Ni + N2)? 
— BG = H+ 2-9 

Nit+N2 


(a — 0)” 


NiN2 _ 
(Mi + .N2)? 


— 
~~ & +e? 


(a — 9)” 


has a distribution which in all but very exceptional cases is quite well approxi- 
mated by a B(3, (Ni + Ne — 2))-function. That is, the distribution of w for 
the v,+~,C , spreads may for practical purposes be found from that of t, by a 
simple transformation, with N; + Ne — 2 degrees of freedom. 

It seems pertinent to make some inquiry into the power of such a test, that is, 
to make an attempt to learn something about the probability that such a test 
will fail to reject the hypothesis of equal means when it is in fact false. To do 
this it is now necessary to specify the populations which have actually been 
sampled. If we suppose that these populations are normal with equal variances 
but with unequal means which, with no loss of generality, may be taken to be p 
and — yz respectively, the probability integral of the distribution of non-central 
t will give our answer. 


If we set 


a 


f+ F+ 2’ 
we have 
t= Vf t/t. 
Also, 


Ni+Ne 


in which s° is the usual estimate of the population variance o based on f = 
Ni + Nz — 2 degrees of freedom. Then 
ii u—vd NiN?2 

8 Ni+ Ne 


and this is a central ¢ if « = —yu = 0, otherwise it is non-central. In the latter 
case we write (the test is made on Z — 9), 


p= SaeGe ete / ee 
8 Nit+Ne 


_#+6 
y ’ 
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= s/o, 


| / WN 
o N+ N2° 


In applying Pitman’s test for a given significance level a, one determines 
whether or not 


P(w > w) 2 a, 
w being the value of w calculated from the sample. This is equivalent to finding 
P(t > &), 
for the proper f, in which 


ia 


and this can be found from an ordinary table of the probability integral of the 
t-distribution. 

For a numerical example let N; = Ne = 10 sothatf = 18. If we adopta5% 
significance level we have tj = 2.101” for the critical value. Let us suppose that 


p/o = 0.1, and calculate the probability that the hypothesis that » = 0 will be 
rejected. We have 6 = 0.1 and 


ts 


, = 0.1969. 
f+t 
Then 
P(t? < ti) = &°"[1(0.5, 9; 0.1969) + 0.1 (1.5, 9; 0.1969) 


+ 1(2.5, 9; 0.1969) + +--+] 


= 0.9292. 


Four terms of the series were enough to give this result. The probability of 
rejecting the hypothesis in this case is thus 0.0708. 

The following tables show results for a = 0.05 and 0.01, u/o = 0.1, 0.2, and 
0.5, and N, = Nez = 10 and 20. 










FREDERICK MOSTELLER 


Values of P(t? > t) 
N, = Neo = 10 















7 | | 
0.1 0.2 0.5 
" a 
0.05 | | | | 0.1355 0.5621 









| 0.0396 





0.01 | 0.0165 | 





N, = Nz = 20 
\ w/e | 


™~y 0.1 | O02 | 04 


I eisdhieneeiceelecerradtine 
0.05 | 0.0947 | 0.2345 | 0.8691 


0.01 | 0.0251 | 0.0862 “0.0862 | 0.6730 





0.6730 

























In only one case was it necessary to calculate as many as ten terms of the 
corresponding series to obtain these values. 
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NOTE ON AN APPLICATION OF RUNS TO QUALITY CONTROL CHARTS 


By FrRepeErRIcK MOosTELLER 


Princeton University 





In the application of statistical methods to quality control work, a customary 
procedure is to construct a control chart with control limits spaced about the 
mean such that under conditions of statistical control, or random sampling, the 
probability of an observation falling outside these limits is a given a (e.g., .05). 
The occurrence of a point outside these limits is taken as an indication of the 
presence of assignable causes of variation in the production line. Such a form 
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of chart has been found to be of particular value in the detection of the presence 
of assignable causes of variability in the quality of manufactured product. As 
recently pointed out, however, the statistician may not only help to detect the 
presence of assignable causes, but also help to discover the causes themselves in 
the course of further research and development. For this purpose, runs of 
different kinds and of different lengths have been found useful by industrial 
statisticians. Quality control engineers have found, at least in research and 
development work, that a convenient indication of lack of control is the occur- 
rence of long runs of observations whose values lie above or below that of the 
median of the sample. For example (as will be shown below), at least one suc- 
cession of 9 or more observations above or below the median in a sample of 40 
would be taken as evidence of lack of control at the .05 level; meaning that 
under conditions of control such a run would occur in approximately 5 per cent 
of the samples. Since this type of test has been found useful by quality control 
engineers, it is perhaps desirable to discuss the mathematical basis of such tests 
of control and provide a brief table for samples of various sizes at the signifi- 
cance levels .05 and .01. 

The general distribution theory of runs of k kinds of elements, and in particular 
that of two kinds has been thoroughly investigated by A. M. Mood” The 
purpose of this note is to give an application of the general method to quality 
control. 

Let us consider a sample of size 2n drawn from a continuous distribution 
function f(z). These are then arranged in the order in which they were drawn. 


We now separate the sample into two sets by considering the nth and (n + 1)st 
elements in order of magnitude, then if x; < 2, , x; will be called an a, and if 
4; = Ln41, X; will be called a b. A run of a’s will be defined as usual as a suc- 
cession of a’s terminated at each end by the occurrence of a b (with the obvious 


exceptions where the run includes the first or last element of the sample), and 


1 The use of ‘‘runs up”’ and ‘‘runs down’”’ as well as runs above and below the arithmetic 
mean of a sample were briefly described in a paper by W. A. Shewhart, “‘Contribution of 
statistics to the science of engineering,’’ before the Bicentennial Celebration of the Uni- 
versity of Pennsylvania, September 17, 1940, to be published in the proceedings of that 
meeting. Ina paper, ‘“Mathematical statistics in mass production,” presented before the 
American Mathematical Society in February, 1941, Shewhart discussed some of the ad- 
vantages of using runs above and below the median and showed how by comparing runs of 
different types in a given problem it is often possible to fix rather definitely the source of 
trouble. The present note considers only the frequency of occurrence of ‘‘long”’ runs which 
are often used by research and development engineers to indicate the presence of assignable 
causes of variation. The occurrence of more than one such run in a given sequence, if dis- 
tributed above and below the median value may also constitute valid evidence of the 
presence of more than one state of statistical control between which the phenomena may 
oscillate. The interpretation of long runs in this sense, however, is not considered in the 
present note. 


* A.M. Mood, “The distribution theory of runs,’’ Annals of Math. Stat., Vol. 11 (1940), 
pp. 367-392. 
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runs of b’s are defined similarly. A run of a’s may conveniently be called a 
run “below the median,” and a run of b’s a run “‘above the median.”’ 

We shall use Mood’s notation throughout, ie., r1;, 721, (¢ = 1, 2, --- , n) are 
the number of runs of a’s and b’s respectively of length 2, and 7; , re are the total 


m al ae ; : . 
number of runs of a’s and b’s; bed will indicate a multinomial coefficient, and 


t 


(7) a binomial coefficient. Also we define 


F(r,, 72) = 0, lr1 — re | = 1, 
F(r,, 72) = 1, |r — | = 1, 
F(r,, re) = 2, lr: — re| = 0. 


Then the distribution of runs of a’s for our case is 


ca) 
(1) Pod = 
2n 
2-4 
We would like to find the probability of at least one run of s or more a’s. The 
coefficient of x” in 
(2) [ztai+-.- +27 J", 


gives the number of ways of partitioning n elements into r; partitions such that 
no partition contains s or more elements, and none is void. Rewriting (2) we 
have 


"(1 ied 2)? = e a ” ’ ‘) x’, 


t=0 T1 


and the coefficient of x” is just 


(3) > (-»'() (" - je - 1) - , 


7=0 7 a = 1 


Then the probability that we desire, of getting at least one run of s or more a’s 
is immediately given by 


P(r, = 1,% = 8) 
_2IG=)- BOO A Ce) 
(*) 
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Noting that when j = 0 in the inner summation we have just the total number 
of partitions, we get finally 


— \ + ") e (—1)# \" “—_ 1 —_ j(s —_ 1) 
‘ = r1 fmt j m- 1 
P(rux 2 1,7 2 8) = Sai ntinentinnleetatmeenenangseetidaetnmasccsnanizsonnnil 
(4) (ry ) On 
n 
A similar result of course holds for the )’s. 

If we desire the probability of getting at least one run of s or more of either 
a’s or b’s, we compute the probability of getting no runs of this type and sub- 
tract from unity. Expression (3) multiplied by the total number of ways of 
getting no partitions of s or more b’s for a given 7;, and then summed on 7 


gives exactly the number of ways of getting no runs of either a’s or b’s as great 
as s. This is 


Beer 


Le re Ze GN”) 


and the probability desired is 


(5) 


In spite of the complex appearance of A, the sum can be rapidly calculated for 
any given s, n since the calculations for the sums on 7 and j need not be duplicated. 

In the case of a quality control chart, we set a significance level a for a given n, 
this determines s the length of run of either type necessary for significance at the 
level chosen. Suppose we are interested only in runs occurring on one side of 
the median, say above, when a = .05, n = 20 (i.e., sample size equal to 40). 
We determine the least value of s which will make the right hand side of equa- 
tion (4) less than or equal to .05. It turns out that s = 8 for this case. This 
means that under conditions of statistical control, i.e., random sampling, one or 
more runs of length 8 or more, above the median will occur in approximately 
5 per cent of samples of size 40. Naturally an identical result holds when we 
are considering only runs below the median. 

On the other hand, if under the same conditions as given above (n = 20, 
a = .05), we are using as our criterion of statistical control the occurrence of 
runs of length s or greater either above or below the median, we must determine 


the least value of s such that 1 — A i) < .05. This value turns out to be 9. 


In other words under conditions of statistical control at least one run of at least 9 


will occur either above or below the median in less than 5 per cent of the cases 
on the average. 
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The following table gives smallest lengths of runs for .05 and .01 significance 
levels for samples of size 10, 20, 30, 40, 50. 





Runs on one side of median Runs on either side of median 
2n a= .05 a= Ol a= .05 a= .0l 
10 5 — 5 — 
20 7 8 7 8 
30 8 9 8 9 
40 8 9 9 10 
50 8 10 10 11 





If there is an odd number of individuals, say 2n + 1, in the sample, we would 
choose the value of the median as the dividing line for our sample and treat the 
data as if there were only 2n cases, thus ignoring the median completely. 

The following table’ gives the probabilities of getting at least one run of s 
or more on one side, either side, and each side of the median for samples of size 10, 
20, and 40. 

Length 2n = 10 2n = 20 2n = 40 


of One Either Each One Eijither Each One Either Each 
Run(s) Side Side Side Side Side Side Side Side Side 


1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
2 976 .992 .960 1.000 1.000 1.000 1.000 1.000 1.000 
3 500 .667 333 870 .956 .784 .992 .999 .986 
4 143 .230 056 457 640 .274 .799 .930 .668 
5 024 .040 .008 .178 .293 .064 .450 .650 .249 
6 060 .106 013 .207 346 8.068 
7 017 032 002 .087 .158 .016 
8 004 .007 .000 .034 .065 .004 
9 001 .001 .000 .013 025 .001 


10 000 .000 000 .005 .009 .000 
11 002 .003 .000 
12 000 .001 .000 
13 000 .000 8.000 


One method of computing such a table is to use expression (4) to obtain the 
probabilities on one side, and to use (6) to get probabilities for either side. 
Then the probabilities for runs on each side may be computed by using the 
relationship 


2P (one side) — P (either side) = P (each side). 


*The author is indebted to Dr. P. S. Olmstead of the Bell Telephone Laboratories 
for kindly placing this table at his disposal. Dr. Olmstead has pointed out that these 
probabilities have been found very useful in research and development work. 
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TEST OF HOMOGENEITY FOR NORMAL POPULATIONS 
By G. A. BAKER 


University of California 


1. Introduction. In biological experiments it is often of interest to test 
whether or not all the subjects can be regarded as coming from the same normal 
population. If they have not come from the same normal population, usually 
the most plausible alternative is that the subjects have come from a population 
which is the combination of two or more normal populations combined in some 
proportions. The combination of normal populations is a “smooth” alternative 
to the hypothesis of a single normal population. Such non-homogeneous popu- 
lations are not the only “smooth” alternatives, of course, but are included 
among the “smooth” alternatives. If there is reason to believe that the only 
deviation from a normal population is due to non-homogeneity, then the results 
of Professor Neyman in his paper [1] are available in studying this problem. 

It is desirable not to make any hypotheses about the mean and standard 
deviation of the sampled population, but to base all computations and tests on 
the data contained in the sample. Such a viewpoint has been stressed in a 
previous paper [2] where it was shown that if the sampling is from a normal 
population, the probability of a.deviation from the mean of a first sample of n 
measured in terms of the standard deviation of the sample is proportional to 


dv 
tt: - =a 
v 
1 
( + n+ :) 
The result (1.1) and Neyman’s results give rise to a test of homogeneity which 
is valid for “large” samples. Empirical results show that fairly conclusive evi- 


dence of non-homogeneity may be obtained with samples of 100. Samples of 50 
or less may be suggestive but rarely decisive. 









(1.1) 
















2. Development of Test. Suppose that a sample of n + 1 is drawn from a 
normal population. It can be regarded as being made up of a first sample of n 
and a second sample of one. The value of v corresponding to (1.1) can then be 
computed and its distribution function is (1.1). This partition, of course, can 
be made in n + 1 ways. That is, n + 1 values of v are determined from a 
random sample of n + 1 from the original parent. It is true that these values 
of v are not independent among themselves. The correlation between the values 
of v, to a first approximation at least, is of the order of 1/n and can be neglected 
if nis “large.” 

A suitable transformation as discussed in [3], [1] and elsewhere, transforms 
(1.1) into a rectangular distribution. 
If the same computations are made when the sampled population is not 
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normal, then the resulting values obtained will not be rectangularly distributed. 
For instance, suppose that the sampled population is 


(2.1) f(z) _ 1 (pe Hem) le? + ge te-a2)*ie*) 

OV 24 
we find that the distribution of v based on the first sample of 2 is a very com- 
plicated expression involving sums of exponentials and definite integrals of expo- 
nentials. To obtain a rectangular distribution if the sampled population is 
normal, the appropriate transformation to make is 


v= —V3 cot ru 
dv = V/3 wesc’ ru du. 


The resulting u-distribution for population (2.1) then is to be compared with 
the rectangular distribution in the interval from zero to one. 

For “large’”’ values of n + 1 and for symmetrical non-homogeneous popula- 
tions composed of two normal components, the wu-distribution will be sym- 
metrical about u = 3, less than one near the ends, greater than one for values 
of u moderately far from } and less than one for values of u near 4. A Neyman 
[1] Wi of order 4 will be necessary to detect a difference of this sort. If the 
non-homogeneous population of two components is skewed, the u-distribution 
will still show the same two-humped effect but may be skewed instead of sym- 
metrical. A Neyman ¥; of order 4 should still be computed, although ¥3 may 
be more significant. 

The test then consists of: 

(a) computing the n + 1 quantities 


(2.2) 





(2.3) i= Fo, @ = 1,2,3,---,a+1) 
where 
n + 1 = number in the sample 
xz; = the observed values 


xz; = the observed values except 2; 


n 1 n 


wa, 8 =- 2 (a; — a 


7 1 
b = — 
N j=l nN j=l 


(b) making the transformation 


*i yoda’ ; 
:= seein = — 1 
(c) computing the first four ¥j’s of Neyman’s paper [1] 
(d) comparing ¥ with ¥2(k) as found from the Incomplete Gamma Function 
Tables. 





+ 1) 


ction 
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If n is large, say n = 100, then u is given approximately by the normal 
probability integral. 

If n is small, the values of u are obtained from the Table 25 of Vol. 2 of 
Pearson’s Tables. 

Neyman’s derivation assumes that n + 1 is large and that the w’s are inde- 
pendent. In this case, if n + 1 is large, then the wu’s are nearly independent, 
and hence the test is valid. The same procedure can be applied for smaller 
samples. It can not be expected that small differences from normal in the 
sampled population can be detected with small samples. Empirical results 
indicate that samples of 100 are necessary for decisive results even when the 
differences of the sampled population from a normal homogeneous population 


are large. Samples of 50 may be suggestive and in very extreme cases might be 
decisive. 


TABLE I 
Empirical Sampling Results 


k=1 k=2 


Vi’s for 51 from population A.............| .0001 .843 | 2.009 | 7.464 
¥i’s for 101 from population A............| .086 | 2.403 | 4.998 |12.868 
.927 | 7.472 | 7.485 

| 

| 


V}’s for 101 from population B............|  .553 | 
V}’s for 101 from normal 017 | .082 
W? os (k)’s (Neyman [1]) ............ 3.842 | 5.992| 7.815 | 9.488 
V(.on(k)’s (Neyman [1]) ..................} 6.635 | 9.210 | 11.345 |13.277 


1.288 | 1.663 


It is to be noted that the test makes no assumption about the parameters of 
the sampled population and does not group the data. The application of the 
test gives a unique result that does not depend on the judgment of the computer 
in any respect. In applying the usual chi-square test the computer must choose 
groupings. The choice of groupings as indicated in [5] may change the P-values 
to very different levels of significance. 


3. Empirical results. Samples of 51 and 101 from population A, of 101 from 
population B, and of 101 from a normal population, were drawn by throwing 
dice. Populations A and B are given in [4]. Population A is symmetrical and 
distinctly bimodal. Population B is weakly bimodal and strongly skewed. 

For samples from population A it is necessary to compute ¥{. For samples 
from population B it may be sufficient to compute Wj. The non-homogeneity 
of the type of population A seems to be somewhat more detectable than of the 
type of population B. The sample from the normal parent shows close con- 
formity with expectation. 

In applying the proposed test for homogeneity the u-values for small inde- 
pendent sets of data can be combined to give a much larger number of u-values. 
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A NOTE ON THE POWER OF THE SIGN TEST 
By W. Mac Stewart 


University of Wisconsin 


1. Introduction. Let us consider a set of N non-zero differences, of which z 
are positive and N — z are negative; and suppose that the hypothesis tested, 
Ho , implies, in independent sampling, that z will be distributed about an ex- 
pected value of N/2 in accordance with the binomial (3 + 4)”. As a quick 
test of Hy , we may choose to test the hypothesis hy that z has the above proba- 
bility distribution. Defining r to be the smaller of z and N — zg, the test con- 
sists in rejecting ho and therefore Hy whenever r < r(e, N), where r(e, N) is 
determined by N and the significance level e. 


2. Power of a test. In applying such a test it is of interest to know how 
frequently it will lead to a rejection of Hy when Hp is false and the situaticn H 
implies that the probability law of z is (q + p)*, with p ¥ 3, thereby indicating 
an expectation of an unequal number of + and — differences. The proba- 
bility of rejecting Hy when H; implying p = py, is true, is termed the power of 
the test of Hp relative to the alternative H,.’ Thus, from the point of view of 
experimental design the power (P) of the test of Hy may be considered a func- 
tion of the alternative hypothesis H, , the significance level e, and N. As such, 
the following observations may be noted: 

i. The power P; , for an assumed e, N, and Hz implying p = pz is greater 
than or equal to the power P, for e, N and H, implying p = p, where 
| pe — .50| > | mi — 50]. 


1For an extensive discussion of the power of a test, the reader is referred to J. Ney- 
man and E. 8. Pearson, Statistical Research Memoirs, Vol. 1 (1936), pp. 3-6. 
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2. The power P, for an assumed H, , N, and e , is greater than or equal to the 
power P; for H,, N, and «, where @ > «. 

3. The power P; for an assumed H,, ¢, and N- is greater than or equal to the 
power P; for H;, e, and N; where Nz > N;. 

Hence, to increase the power of the test of Ho relative to a particular H;, 
the methods implied in observations 2 and/or 3 may be employed. However, 
if any increase in an established ¢ is undesirable, the method implied in observa- 
tion 3 is the alternative. 


3. Explanation of table. In the interests of efficiency and economy, two 
questions then arise: (1) What is the minimum value of N, which, at the signifi- 
cance level ¢, will give the test of Hp a power P > 8, relative to a particular 
alternative hypothesis H,? (2) For this minimum value of N corresponding 
to ¢, what is the maximum value of r? Stated in another manner, the questions 
are these: “What is the smallest number (min N) of paired samples that must 
be employed in conjunction with the Sign Test in order that the test of Ho, 
at the significance level ¢, shall have a power P > 8 relative to an alternative 
hypothesis H, ?’”’ (2) If z of these paired samples give rise to a positive differ- 
ence, and (min N — z) a negative difference, and if r be defined as the smaller 
of z and (min N — 2); then, what is the maximum value that r may attain and 
still have the results, at the level ¢, judged significant? 

Table I provides the answers to these questions for the significance level 
e S .05; and (1) for H,; implies p = p, for values of p; from .60 to .95 (and 
thereby from .40 to .05) at intervals of .05; (2) for values of 6 from .05 to .95 
at intervals of .05, and also for 6 > .99. For example, assume that a power 
P > .80 relative to the alternative hypothesis H; (p: = .70) is desired. In 
Table I, the entry appearing in the column headed H; (p; = .70), and in the 
row P > .80 is 49,17—indicating that 49 paired samples are required, of which 
17 or less must be of one sign (+ or —) and hence 32 or more must be of the 
opposite sign in order that the results be significant at the .05 level. 

Because of the discreteness of the binomial distribution, it is impossible to 
maintain the level of significance at .05 or even arbitrarily close to that figure 
and still hold to the criterion that N shall be at a minimum. For that reason, 
particularly when min N is small, results significant at .05 according to Table I 
may be significant at a level e’ where e¢’ is considerably less than .05. In general, 
however, and in particular when min N is large (greater than 50) both the 
quantities (.05 — e’) and (P — 8) are small. 


4. Illustrative example. Goulden’ describes a simple experiment in identi- 
fying varieties of wheat. In this experiment, a wheat “expert” is presented 
paired grain samples of two particular varieties of wheat. The object of the 


*C. H. Goulden, Methods of Statistical Analysis, John Wiley and Sons, New York, 1939, 
p. 2. 
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experiment is to test the ability of the expert to differentiate between the two 
varieties by arranging the pairs so that samples of one variety are on the left, 
say, and samples of the other variety are on the right. 

In a problem of this type, it is desirable to have a sufficiently large number, N, 
of paired samples in order that the following conditions be fulfilled: (1) The 
probability that a person possessing no discriminating ability pass the test 


TABLE I 
Minimum number of paired samples and maximum values of related r 
Hy ~ po = .50 
(5% level of significance, i.e., « < .05) 
(min N, max r) 





Hs H; He Hs A, Hs Hz A, 
PowER pi=.95 pi=.90 pi=.85 pi=.80 pi=.75 pi=.70 pi = .65 pi=.60 
0<P< .05 — — — — — — 7,0 6,0 
P> .05 — — — — —_ 7,0 6,0 9,1 
r> ® — — — —_ 7,0 6,0 9,1 17,4 
P> .i — — — 8,0 6,0 9,1 12,2 25 ,7 
P> .20 — — — 7,0 10,1 13 ,2 17,4 37 ,12 
P> .25 — — 8,0 6,0 14,2 12,2 23 ,6 44,15 
P> .30 — — 7,0 11,1 9,1 18,4 25,7 56 , 20 
P> .35 — — 6,0 10,1 12,2 17,4 30,9 65 ,24 
P> .40 — 8,0 — 9,1 16,3 20 ,5 35,11 74,28 
r> & — 7,0 11,1 — 15,3 26 ,7 42,14 89,35 
P> .&0 — 6,0 10,1 13,2 18,4 25,7 44,15 101,40 
P> .55 — — 9,1 12,2 17,4 30,9 51,18 112,45 
P> .60 — — 14,2 15,3 20,5 36,11 56,20 125,51 
P> .65 7,0 11,1 18,2 19,4 23,6 35,11 63,23 143,59 
P> .70 6, 10,1 12,2 18,4 25,7 40,13 67,25 158,66 
P> . — 9,1 16,3 17,4 28,8 44,15 79,30 175,74 
P> .80 — 14,2 15,3 20,5 30,9 49,17 90,35 199,85 
P> .85 11,1 12,2 18,4 25,7 35,11 56,20 101,40 227,98 - 
P> .90 9,1 15,3 17,4 28,8 42,14 65,24 114,46 263,115 
P> .95 12,2 17,4 23,6 35,11 49,17 79,30 143,59 327,145 
P> .99 15,3 23,6 30,9 44,15 67,25 110,44 199,85 453,205 


through sheer guesswork be less than e; and (2) if past experience has. proven 
that an expert does possess the ability to discriminate between the varieties to 
the extent of placing a proportion, p, , of the pairs correctly in the long run, 
then the probability that he will pass the test be P. 

Under these conditions, how large an N is required, and for that N, what is 
the maximum number of pairs that may be incorrectly placed without failing 
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the test? For alternative hypothesis H, (pi = .75), and for P > .90, referring 
to Table I, it is seen that 42 paired samples must be employed and not more 
than 14 may be placed incorrectly. Under the same alternative hypothesis, if 
it be required merely that P > .50 (i.e., an expert with an ability of .75 have 
better than an even chance of passing), then only 18 paired samples are necessary 
and not more than 4 may be arranged incorrectly. 

Thus, before conducting an experiment in which the Sign Test is to be em- 
ployed, if the experimenter first decides what power the test must have relative 
to a certain alternative hypothesis; then from the accompanying table he may 
learn the minimum number of paired samples that are necessary ; and the related 
maximum value of r. 

If this procedure is not followed, and an experimenter employs, say 6 paired 
samples, he may (as can be seen from the table) discover, to his dismay, that 
“experts” of ability .75 will be unrecognized more than 80% of the time. 


MOMENTS OF THE RATIO OF THE MEAN SQUARE SUCCESSIVE 
DIFFERENCE TO THE MEAN SQUARE DIFFERENCE IN 
SAMPLES FROM A NORMAL UNIVERSE 


By J. D. WiLi1AMs 
Phoenix, Arizona 
The following result may have considerable application to trend analysis. 
The specific problem was proposed to me by R. H. Kent. 
Consider a sample 0, : Xi, X2, --- , X, from a normal population with zero 
mean and variance o’, the variates being arranged in temporal order. We seek 
the moments of the ratio of 6° to S’, where 


n—l 


(1) (n _ 1)8° = Zz (X; - Xin)” 


j=1 
and 


(2) nS = : (x, — )°. 
= 
Here X is the mean of the X;. In order to simplify the algebra,. we will work 
with quantities A and B defined by 
20°A = (n — 1)8, 
20°B = nS’. 
The characteristic function for the joint distribution of A and B is 
(tr, 2) = E(e***?*) 


. (-) [ff (41 +m 12x) Iax, 


(3) 


(4) 
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where ¢, and & are pure imaginaries. For the method of analysis which will be 
used here ¢; and ¢ will be considered as real variables. By straight forward 
methods we have 


abd d 
bc bd : 
d bcobd 
(5) ¢ (t,t) = 
d bcobd 
: d bc b 
d dba 


where the determinant is of nth order and its elements are 


a=1—%4—(n-—1)T 


b = ty + T 
(6) 
c=1-—-— 24 —(n-—1)T 
d= T = te/n. 
It can be verified that the determinant has the value 
~ n—1 who tl ; sini 
7) ota) = © ("27") wa - 0, 
i=0 J 
where the symbol * "- " ‘) represents a binomial coefficient. From (7) we 


find the moments m; of A/B as follows: Setting 

(8) n= Dts, 
k=1 

we have 


m= ff- fe | Mate 


j 
2'| Fo(4, 0) | 


~ =D tt)--- (n+ 2j— 3) 


The result is rather unexpected, for we have established that the moments of 
A/B are equal to the moments of A divided by the moments of B. 





(9) 


MOMENTS OF A RATIO 


We find the following explicit values for the first few moments m; : 
=] 
=2 
— 1)(n + 1)m = 4(n? + n — 3) 
— 1)(n + 1)(n + 3)m; = 8(n® + 6n® + 2n — 21) 
(n — 1)(n + 1)(n + 3)(n + 5)m, = 16(n* + 14n* + 53n? — 8n — 231). 


These are valid subject to the restriction 2n — 1 > j, because in arriving at the 


explicit forms we have treated the binomial coefficient (‘) as if it were iden- 


tically equal to k(k — 1) --- (k —j7 + 1)/7!. 
From (10) it is easy to pass to the moments of R = 6°/S’. For example, we 
find the mean value and variance of R to be 


2n 
n—1 


4n?(n — 2) 
(n + 1)(m — 1)? 


respectively. 











